How to Normalize JSON or Dict to New Columns in Pandas
In this article, we will see how to convert JSON or string representation of dictionaries in Pandas.
JSON(JavaScript Object Notation) data and dictionaries can be stored and imported in different ways. This might result in unexpected results or need to convert them to new columns.
In order to convert JSON, dicts and lists to tabular form we can use several different options. Let's cover the most popular of them in next steps.
Setup
In the post, we'll use the following DataFrame, which has columns:
col_json
- JSON column stored as JSONcol_str_dict
- JSON column stored as a stringcol_str_dict_list
- JSON column with nested list
import pandas as pd
data = {'col_json': {0: {'x': 1, 'y': 0, 'xy':1},
1: {'x': 0, 'y': 1, 'xy':1},
2: {'x': 1, 'y': 1, 'xy':1}},
'col_str_dict': {0: "{'x': 1, 'y': 0, 'xy':1}",
1: "{'x': 0, 'y': 1, 'xy':1}",
2: "{'x': 1, 'y': 1, 'xy':1}"}
,
'col_str_dict_list': {0: '{"x": [1,1]}',
1: '{"x": [0,1]}',
2: '{"x": [1,0]}'}
}
df = pd.DataFrame(data)
df
It's not possible to distinguish how JSON is stored from the dtypes:
df.dtypes
results into:
col_json object
col_str_dict object
col_str_dict_list object
dtype: object
DataFrame data looks like:
col_json | col_str_dict | col_str_dict_list | |
---|---|---|---|
0 | {'x': 1, 'y': 0, 'xy': 1} | {'x': 1, 'y': 0, 'xy':1} | {"x": [1,1]} |
1 | {'x': 0, 'y': 1, 'xy': 1} | {'x': 0, 'y': 1, 'xy':1} | {"x": [0,1]} |
2 | {'x': 1, 'y': 1, 'xy': 1} | {'x': 1, 'y': 1, 'xy':1} | {"x": [1,0]} |
1: Normalize JSON - json_normalize
Since Pandas version 1.2.4 there is new method to normalize JSON data: pd.json_normalize()
It can be used to convert a JSON column to multiple columns:
pd.json_normalize(df['col_json'])
this will result into new DataFrame with values stored in the JSON:
x | y | xy | |
---|---|---|---|
0 | 1 | 0 | 1 |
1 | 0 | 1 | 1 |
2 | 1 | 1 | 1 |
The method pd.json_normalize has several parameters like:
record_path
- Path in each object to list of records. If not passed, data will be assumed to be an array of records.meta
- Fields to use as metadata for each record in the resulting table.errors
- Configures error handling -ignore
orraise
max_level
- Max number of levels(depth of dict) to normalize. if None, normalizes all levels.
How to use them in more complex examples is covered in Step 5.
2: Flattening JSON - .apply(pd.Series)
As an alternative solution we can use .apply(pd.Series)
. The result is the same:
df['col_json'].apply(pd.Series)
this will result into new DataFrame with values stored in the JSON:
x | y | xy | |
---|---|---|---|
0 | 1 | 0 | 1 |
1 | 0 | 1 | 1 |
2 | 1 | 1 | 1 |
Method pd.json_normalize is faster than
.apply(pd.Series)- tested by `%%timeit`
Compare performance of json_normalize
and .apply(pd.Series)
:
json_normalize
- 361 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each).apply(pd.Series)
- 1.29 ms ± 21.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3: Parse JSON - json.loads + ast.literal_eval
Methods ast.literal_eval
and json.loads
help us to parse JSON data.
They can help us when we need to read and parse JSON stored as string.
Let's see them how they work and what is the key difference:
json.loads('{"x": 1}')
result:
{'x': 1}
ast.literal_eval('{"x": 1}')
we get the same result:
{'x': 1}
Method json.loads can work only with valid JSONs while
ast.literal_evalcan load wider variety of JSON like syntax
Which means that:
json.loads("{'x': 1}")
will error with:
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
But we can convert non standard JSON data by ast.literal_eval
:
ast.literal_eval("{'x': 1}")
4: Flattening JSON stored as string
What if we like to normalize JSON which is stored as string in Pandas column.
Using previous steps will not help. It will result in a single column named 0
.
In order to convert dict or JSON stored as a string to multiple columns we can use combination of:
pd.json_normalize
ast.literal_eval
orjson.loads
(for strict JSON format)
import ast
df['col_str_dict'].apply(ast.literal_eval)
This will result into:
x | y | xy | |
---|---|---|---|
0 | 1 | 0 | 1 |
1 | 0 | 1 | 1 |
2 | 1 | 1 | 1 |
5: Flattening nested JSON - pd.json_normalize
Suppose we have data as follow:
data = [
{
'pos': 'east',
'population': 55,
'continent': 'Africa',
'info': {
'people': {
'environmentalist': 'Wangari Maathai',
'journalist': 'Ngugi wa Thiong''o'
}
},
'city': [
{
'name': 'Nairobi',
'capital': True,
'population': { 'total': 4.4, 'rank': 11 }
},
{
'name': 'tigerfish',
'capital': False,
'population': { 'total': 1.2, 'rank': 12 }
},
]
},
{
'pos': 'central',
'population': 218,
'continent': 'Africa',
'people': {
'famous': {
'writer': 'Ken Saro-Wiwa',
'artist': 'Tiwa Savage'
}
},
'city': [
{ 'name': 'Lagos', 'capital': True },
{ 'name': 'Kano', 'capital': False },
]
}
]
The table represents the data stored as a DataFrame:
pos | population | continent | info | city | people | |
---|---|---|---|---|---|---|
0 | east | 55 | Africa | {'people': {'environmentalist': 'Wangari Maathai', 'journalist': 'Ngugi wa Thiongo'}} | [{'name': 'Nairobi', 'capital': True, 'population': {'total': 4.4, 'rank': 11}}, {'name': 'tigerfish', 'capital': False, 'population': {'total': 1.2, 'rank': 12}}] | NaN |
1 | central | 218 | Africa | NaN | [{'name': 'Lagos', 'capital': True}, {'name': 'Kano', 'capital': False}] | {'famous': {'writer': 'Ken Saro-Wiwa', 'artist': 'Tiwa Savage'}} |
If we need to specify the path and the metadata we can do:
pd.json_normalize(
data,
record_path =['city'],
meta=['pos', 'continent', 'population'],
errors='ignore'
)
which will result into normalized form:
name | capital | population.total | population.rank | pos | continent | population | |
---|---|---|---|---|---|---|---|
0 | Nairobi | True | 4.4 | 11.0 | east | Africa | 55 |
1 | tigerfish | False | 1.2 | 12.0 | east | Africa | 55 |
2 | Lagos | True | NaN | NaN | central | Africa | 218 |
3 | Kano | False | NaN | NaN | central | Africa | 218 |
6: Flattening a JSON with multiple levels
There is a Python library accessible at: flatten-json.
It can be installed by:
pip install flatten-json
The library is described as:
Flattens JSON objects in Python. flatten_json flattens the hierarchy in your object which can be useful if you want to force your objects into a table.
For this step we are going to create additional DataFrame:
data = {'col_json': {0: {"x": 1.0, "y": 0.0},
1: {"x": 0.0, "y": 1.0},
2: {"x": 1.0, "y": 1.0}}
}
df2 = pd.DataFrame(data)
To demonstrate how we can flatten JSON objects in Python and Pandas:
from flatten_json import flatten
df2['col_json'].apply(flatten).apply(pd.Series)
the result:
x | y | |
---|---|---|
0 | 1.0 | 0.0 |
1 | 0.0 | 1.0 |
2 | 1.0 | 1.0 |
7: Parse local/remote JSON file
If JSON data is stored as a file - locally or remotely we can normalize it with few additional lines:
Normalize local JSON file in Pandas
The code below will load and normalize local file in the same folder as the script:
import json
with open('data.json','r') as f:
data = json.loads(f.read())
pd.json_normalize(data)
Normalize remote JSON file in Python
In order to load and normalize JSON data from a remote file we can use the following code:
import requests
URL = 'https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/sample.json'
data = json.loads(requests.get(URL).text)
pd.json_normalize(data)
8: Concatenate expanded data to DataFrame
If you need to concatenate the normalized or flattened data to the original DataFrame we can use method concat
:
import json
pd.concat([df, df['col_str_dict_2'].apply(json.loads).apply(pd.Series)], axis=1)
and apply the operations on columns:
col_json | col_str_dict | col_str_dict_list | x | |
---|---|---|---|---|
0 | {'x': 1, 'y': 0, 'xy': 1} | {'x': 1, 'y': 0, 'xy':1} | {"x": [1,1]} | [1, 1] |
1 | {'x': 0, 'y': 1, 'xy': 1} | {'x': 0, 'y': 1, 'xy':1} | {"x": [0,1]} | [0, 1] |
2 | {'x': 1, 'y': 1, 'xy': 1} | {'x': 1, 'y': 1, 'xy':1} | {"x": [1,0]} | [1, 0] |
9. Multiple levels, custom parsing
Sometimes we might have multiple levels and errors. We can bypass the errors from json conversion by using custom method:
import ast
import pandas as pd
import json
def parse_eval(value):
try:
return ast.literal_eval(value)
except (ValueError, SyntaxError):
return value
pd.DataFrame(json.loads("[" + df1['items'].str.replace("\'", "").replace(r"\\\"", "", regex=False).apply(parse_eval).apply(pd.Series)[0].iloc[0] + "]")).title
In the code above:
- we load content of column items
- then fix the extra characters like
'
and"
- parse the json
- expand nested values
- get the content
- load the string as json (adding parathensis)
- create dataframe
- finally extra field title only
Often you need to apply custom steps depending on your case.
More examples
How to flatten nested JSON in Pandas:
import ast
df['col_str_dict_2'].apply(ast.literal_eval).apply(pd.Series)
Normalize semi-structured JSON data into a flat table:
import ast
df['col_str_dict_2'].apply(pd.json_normalize)
Fix multiple errors with the input JSON data and format:
- replace quotes
- replace unicode characters
- load as JSON
- convert to columns
df_spm['result'].str.replace('\\"', '', regex=False).replace("\\\'", '', regex=True)\
.replace(u'\u200e', '', regex=True).str.encode('ascii','ignore')\
.apply(json.loads).apply(pd.Series)
TypeError: the JSON object must be str, bytes or bytearray, not 'float'
Sometimes we might get errors like:
TypeError: the JSON object must be str, bytes or bytearray, not 'float'
TypeError: the JSON object must be str, bytes or bytearray, not dict
Different reasons can cause similar errors. This error will be raised if we try to apply json.loads
to a JSON data:
df2['col_json'].apply(json.loads)
To avoid such errors we might convert the column to string or parse it by library flatten_json
:
from flatten_json import flatten
df2['col_json'].apply(flatten).apply(pd.Series)
TypeError: string indices must be integers
This error may be the result of misuse of the method: pd.json_normalize()
. Be sure to pass JSON data.
If we are passing DataFrame then we need to convert it to proper JSON by:
data.to_dict()
ValueError: malformed node or string on line 1
If you face Pandas error:
ValueError: malformed node or string on line 1: <ast.Name object at 0x7fc439d3b3a0>
while using .apply(ast.literal_eval)
.
Then you can try:
df_spm['result'].str.replace('\\"', '', regex=False).replace("\\\'", '', regex=True)\
.replace(u'\u200e', '', regex=True).str.encode('ascii','ignore')\
.apply(json.loads).apply(pd.Series)
Conclusion
In this article we covered multiple ways to convert JSON data or columns containing JSON data to multiple columns.
We discussed different problems and solutions of most typical problems. It mentioned performance benefits and working with multiple level JSON data.