How to Keep the First Value of Column After Explode in Pandas?

In this quick tutorial, we're going to look at how to keep the first value of column after explode in Pandas?

Suppose we have a DataFrame which has a column with nested data - list or JSON. Let's work with the following DataFrame:

import pandas as pd

data = {'ID': [1, 2, 3],
        'Items': [['A', 'B'], ['C', 'D'], ['E', 'F', 'G']]}

df = pd.DataFrame(data)

Data looks like:

ID Items
0 1 [A, B]
1 2 [C, D]
2 3 [E, F, G]

Explode the column items will return mulitple items per row:


result:

ID Items
0 1 A
0 1 B
1 2 C
1 2 D
2 3 E
2 3 F
2 3 G

Notice that index contains duplicates for each item present in the original column

Explode List Column Keep First Item

To explode column Items and keep only the first item per each row we can drop duplicates:

d = df.explode('Items')
d[~d.index.duplicated()]

result:

ID Items
0 1 A
1 2 C
2 3 E

Alternative simpple solution for List columns is the following:

df['Items'].apply(lambda x: x[0] if isinstance(x, list) else x)

The result is the same

Explode JSON column - keep first item

For JSON columns we can use the following code. Let say that we work with column 'tags' which contains JSON data like: [{'id': '62b5ac97cdb6600403c69f0f', 'name': 'Cheat Sheet', 'slug': '108-cheat-sheet', 'created_at': '2022-06-24T12:22:47.000Z'...}]

This column can be exploded and set to the original DataFrame by:

dd = df.explode('tags')['tags']
dd = pd.json_normalize(dd[~dd.index.duplicated()])
df[['tag', 'tag_published_at']] = dd[['slug', 'created_at']]

This will extract the complex structure and get only specfic columns and values.

If you need to explode given column and set default value you can use mask:

import pandas as pd

data = {'ID': [1, 2, 3],
        'Items': [['A', 'B'], ['C', 'D'], ['E', 'F', 'G']],
        'val': [5, 10, 7]}

df = pd.DataFrame(data)

data:

ID Items val
0 1 [A, B] 5
1 2 [C, D] 10
2 3 [E, F, G] 7

Explode and Mask

Expanding with setting value for another column we can do:

d = df.explode('Items')
d['val'] = d['val'].mask(d['ID'].duplicated(), 0)
d

results in:

ID Items val
0 1 A 5
0 1 B 0
1 2 C 10
1 2 D 0
2 3 E 7
2 3 F 0
2 3 G 0

Now we can keep first or different item by:

d[d['val'] != 0]

Conclusion

By using a lambda function or Pandas functions, you can preserve the first value of the specified column even after exploding it.

This ensure that your data maintains its original structure while benefitting from the exploded format.