In this quick tutorial, we're going to look at how to keep the first value of column after explode in Pandas?
Suppose we have a DataFrame which has a column with nested data - list or JSON. Let's work with the following DataFrame:
import pandas as pd
data = {'ID': [1, 2, 3],
'Items': [['A', 'B'], ['C', 'D'], ['E', 'F', 'G']]}
df = pd.DataFrame(data)
Data looks like:
ID | Items | |
---|---|---|
0 | 1 | [A, B] |
1 | 2 | [C, D] |
2 | 3 | [E, F, G] |
Explode the column items will return mulitple items per row:
result:
ID | Items | |
---|---|---|
0 | 1 | A |
0 | 1 | B |
1 | 2 | C |
1 | 2 | D |
2 | 3 | E |
2 | 3 | F |
2 | 3 | G |
Notice that index contains duplicates for each item present in the original column
Explode List Column Keep First Item
To explode column Items and keep only the first item per each row we can drop duplicates:
d = df.explode('Items')
d[~d.index.duplicated()]
result:
ID | Items | |
---|---|---|
0 | 1 | A |
1 | 2 | C |
2 | 3 | E |
Alternative simpple solution for List columns is the following:
df['Items'].apply(lambda x: x[0] if isinstance(x, list) else x)
The result is the same
Explode JSON column - keep first item
For JSON columns we can use the following code. Let say that we work with column 'tags'
which contains JSON data like: [{'id': '62b5ac97cdb6600403c69f0f', 'name': 'Cheat Sheet', 'slug': '108-cheat-sheet', 'created_at': '2022-06-24T12:22:47.000Z'...}]
This column can be exploded and set to the original DataFrame by:
dd = df.explode('tags')['tags']
dd = pd.json_normalize(dd[~dd.index.duplicated()])
df[['tag', 'tag_published_at']] = dd[['slug', 'created_at']]
This will extract the complex structure and get only specfic columns and values.
If you need to explode given column and set default value you can use mask:
import pandas as pd
data = {'ID': [1, 2, 3],
'Items': [['A', 'B'], ['C', 'D'], ['E', 'F', 'G']],
'val': [5, 10, 7]}
df = pd.DataFrame(data)
data:
ID | Items | val | |
---|---|---|---|
0 | 1 | [A, B] | 5 |
1 | 2 | [C, D] | 10 |
2 | 3 | [E, F, G] | 7 |
Explode and Mask
Expanding with setting value for another column we can do:
d = df.explode('Items')
d['val'] = d['val'].mask(d['ID'].duplicated(), 0)
d
results in:
ID | Items | val | |
---|---|---|---|
0 | 1 | A | 5 |
0 | 1 | B | 0 |
1 | 2 | C | 10 |
1 | 2 | D | 0 |
2 | 3 | E | 7 |
2 | 3 | F | 0 |
2 | 3 | G | 0 |
Now we can keep first or different item by:
d[d['val'] != 0]
Conclusion
By using a lambda function or Pandas functions, you can preserve the first value of the specified column even after exploding it.
This ensure that your data maintains its original structure while benefitting from the exploded format.