How to Normalize JSON or Dict to New Columns in Pandas

In this article, we will see how to convert JSON or string representation of dictionaries in Pandas.

JSON(JavaScript Object Notation) data and dictionaries can be stored and imported in different ways. This might result in unexpected results or need to convert them to new columns.

In order to convert JSON, dicts and lists to tabular form we can use several different options. Let's cover the most popular of them in next steps.

Setup

In the post, we'll use the following DataFrame, which has columns:

  • col_json - JSON column stored as JSON
  • col_str_dict - JSON column stored as a string
  • col_str_dict_list - JSON column with nested list
import pandas as pd
data = {'col_json': {0: {'x': 1, 'y': 0, 'xy':1},
                     1: {'x': 0, 'y': 1, 'xy':1},
                     2: {'x': 1, 'y': 1, 'xy':1}},
        'col_str_dict': {0: "{'x': 1, 'y': 0, 'xy':1}",
                         1: "{'x': 0, 'y': 1, 'xy':1}",
                         2: "{'x': 1, 'y': 1, 'xy':1}"}
        ,
        'col_str_dict_list': {0: '{"x": [1,1]}',
                         1: '{"x": [0,1]}',
                         2: '{"x": [1,0]}'}
       }
df = pd.DataFrame(data)
df

It's not possible to distinguish how JSON is stored from the dtypes:

df.dtypes

results into:

col_json             object
col_str_dict         object
col_str_dict_list    object
dtype: object

DataFrame data looks like:

col_json col_str_dict col_str_dict_list
0 {'x': 1, 'y': 0, 'xy': 1} {'x': 1, 'y': 0, 'xy':1} {"x": [1,1]}
1 {'x': 0, 'y': 1, 'xy': 1} {'x': 0, 'y': 1, 'xy':1} {"x": [0,1]}
2 {'x': 1, 'y': 1, 'xy': 1} {'x': 1, 'y': 1, 'xy':1} {"x": [1,0]}

1: Normalize JSON - json_normalize

Since Pandas version 1.2.4 there is new method to normalize JSON data: pd.json_normalize()

It can be used to convert a JSON column to multiple columns:

pd.json_normalize(df['col_json'])

this will result into new DataFrame with values stored in the JSON:

x y xy
0 1 0 1
1 0 1 1
2 1 1 1

The method pd.json_normalize has several parameters like:

  • record_path - Path in each object to list of records. If not passed, data will be assumed to be an array of records.
  • meta - Fields to use as metadata for each record in the resulting table.
  • errors - Configures error handling - ignore or raise
  • max_level - Max number of levels(depth of dict) to normalize. if None, normalizes all levels.

How to use them in more complex examples is covered in Step 5.

2: Flattening JSON - .apply(pd.Series)

As an alternative solution we can use .apply(pd.Series). The result is the same:

df['col_json'].apply(pd.Series)

this will result into new DataFrame with values stored in the JSON:

x y xy
0 1 0 1
1 0 1 1
2 1 1 1
× Pro Tip
Method pd.json_normalize is faster than
.apply(pd.Series)
- tested by `%%timeit`

Compare performance of json_normalize and .apply(pd.Series) :

  • json_normalize - 361 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • .apply(pd.Series) - 1.29 ms ± 21.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

3: Parse JSON - json.loads + ast.literal_eval

Methods ast.literal_eval and json.loads help us to parse JSON data.

They can help us when we need to read and parse JSON stored as string.

Let's see them how they work and what is the key difference:

json.loads('{"x": 1}')

result:

{'x': 1}
ast.literal_eval('{"x": 1}')

we get the same result:

{'x': 1}
× Pro Tip
Method json.loads can work only with valid JSONs while
ast.literal_eval
can load wider variety of JSON like syntax

Which means that:

json.loads("{'x': 1}")

will error with:

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

But we can convert non standard JSON data by ast.literal_eval:

ast.literal_eval("{'x': 1}")

4: Flattening JSON stored as string

What if we like to normalize JSON which is stored as string in Pandas column.

Using previous steps will not help. It will result in a single column named 0.

In order to convert dict or JSON stored as a string to multiple columns we can use combination of:

  • pd.json_normalize
  • ast.literal_eval or json.loads (for strict JSON format)
import ast
df['col_str_dict'].apply(ast.literal_eval)

This will result into:

x y xy
0 1 0 1
1 0 1 1
2 1 1 1

5: Flattening nested JSON - pd.json_normalize

Suppose we have data as follow:

data = [
    {
        'pos': 'east',
        'population': 55,
        'continent': 'Africa',
        'info': {
            'people': {
                'environmentalist': 'Wangari Maathai',
                'journalist': 'Ngugi wa Thiong''o'
            }
        },
        'city': [
            {
                'name': 'Nairobi',
                'capital': True,
                'population': { 'total': 4.4, 'rank': 11 }
            },
            {
                'name': 'tigerfish',
                'capital': False,
                'population': { 'total': 1.2, 'rank': 12 }
            },
        ]
    },
    {
        'pos': 'central',
        'population': 218,
        'continent': 'Africa',
        'people': {
            'famous': {
                'writer': 'Ken Saro-Wiwa',
                'artist': 'Tiwa Savage'
            }
        },
        'city': [
            { 'name': 'Lagos', 'capital': True },
            { 'name': 'Kano', 'capital': False },
        ]
    }
]

The table represents the data stored as a DataFrame:

pos population continent info city people
0 east 55 Africa {'people': {'environmentalist': 'Wangari Maathai', 'journalist': 'Ngugi wa Thiongo'}} [{'name': 'Nairobi', 'capital': True, 'population': {'total': 4.4, 'rank': 11}}, {'name': 'tigerfish', 'capital': False, 'population': {'total': 1.2, 'rank': 12}}] NaN
1 central 218 Africa NaN [{'name': 'Lagos', 'capital': True}, {'name': 'Kano', 'capital': False}] {'famous': {'writer': 'Ken Saro-Wiwa', 'artist': 'Tiwa Savage'}}

If we need to specify the path and the metadata we can do:

pd.json_normalize(
    data,
    record_path =['city'],
    meta=['pos', 'continent', 'population'],
    errors='ignore'
)

which will result into normalized form:

name capital population.total population.rank pos continent population
0 Nairobi True 4.4 11.0 east Africa 55
1 tigerfish False 1.2 12.0 east Africa 55
2 Lagos True NaN NaN central Africa 218
3 Kano False NaN NaN central Africa 218

6: Flattening a JSON with multiple levels

There is a Python library accessible at: flatten-json.

It can be installed by:

pip install flatten-json

The library is described as:

Flattens JSON objects in Python. flatten_json flattens the hierarchy in your object which can be useful if you want to force your objects into a table.

For this step we are going to create additional DataFrame:

data = {'col_json': {0: {"x": 1.0, "y": 0.0},
                     1: {"x": 0.0, "y": 1.0},
                     2: {"x": 1.0, "y": 1.0}}
       }
df2 = pd.DataFrame(data)

To demonstrate how we can flatten JSON objects in Python and Pandas:

from flatten_json import flatten
df2['col_json'].apply(flatten).apply(pd.Series)

the result:

x y
0 1.0 0.0
1 0.0 1.0
2 1.0 1.0

7: Parse local/remote JSON file

If JSON data is stored as a file - locally or remotely we can normalize it with few additional lines:

Normalize local JSON file in Pandas

The code below will load and normalize local file in the same folder as the script:

import json

with open('data.json','r') as f:
    data = json.loads(f.read())
    
pd.json_normalize(data)

Normalize remote JSON file in Python

In order to load and normalize JSON data from a remote file we can use the following code:

import requests
URL = 'https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/sample.json'
data = json.loads(requests.get(URL).text)

pd.json_normalize(data)

8: Concatenate expanded data to DataFrame

If you need to concatenate the normalized or flattened data to the original DataFrame we can use method concat:

import json
pd.concat([df, df['col_str_dict_2'].apply(json.loads).apply(pd.Series)], axis=1)

and apply the operations on columns:

col_json col_str_dict col_str_dict_list x
0 {'x': 1, 'y': 0, 'xy': 1} {'x': 1, 'y': 0, 'xy':1} {"x": [1,1]} [1, 1]
1 {'x': 0, 'y': 1, 'xy': 1} {'x': 0, 'y': 1, 'xy':1} {"x": [0,1]} [0, 1]
2 {'x': 1, 'y': 1, 'xy': 1} {'x': 1, 'y': 1, 'xy':1} {"x": [1,0]} [1, 0]

9. Multiple levels, custom parsing

Sometimes we might have multiple levels and errors. We can bypass the errors from json conversion by using custom method:

import ast
import pandas as pd
import json

def parse_eval(value):
	try:
		return ast.literal_eval(value)
	except (ValueError, SyntaxError):
		return value


pd.DataFrame(json.loads("[" + df1['items'].str.replace("\'", "").replace(r"\\\"", "", regex=False).apply(parse_eval).apply(pd.Series)[0].iloc[0] + "]")).title

In the code above:

  • we load content of column items
  • then fix the extra characters like ' and "
  • parse the json
  • expand nested values
  • get the content
  • load the string as json (adding parathensis)
  • create dataframe
  • finally extra field title only

Often you need to apply custom steps depending on your case.

More examples

How to flatten nested JSON in Pandas:

import ast
df['col_str_dict_2'].apply(ast.literal_eval).apply(pd.Series)

Normalize semi-structured JSON data into a flat table:

import ast
df['col_str_dict_2'].apply(pd.json_normalize)

Fix multiple errors with the input JSON data and format:

  • replace quotes
  • replace unicode characters
  • load as JSON
  • convert to columns
df_spm['result'].str.replace('\\"', '', regex=False).replace("\\\'", '', regex=True)\
    .replace(u'‎\u200e', '', regex=True).str.encode('ascii','ignore')\
    .apply(json.loads).apply(pd.Series)

TypeError: the JSON object must be str, bytes or bytearray, not 'float'

Sometimes we might get errors like:

  • TypeError: the JSON object must be str, bytes or bytearray, not 'float'
  • TypeError: the JSON object must be str, bytes or bytearray, not dict

Different reasons can cause similar errors. This error will be raised if we try to apply json.loads to a JSON data:

df2['col_json'].apply(json.loads)

To avoid such errors we might convert the column to string or parse it by library flatten_json:

from flatten_json import flatten
df2['col_json'].apply(flatten).apply(pd.Series)

TypeError: string indices must be integers

This error may be the result of misuse of the method: pd.json_normalize(). Be sure to pass JSON data.

If we are passing DataFrame then we need to convert it to proper JSON by:

data.to_dict()

ValueError: malformed node or string on line 1

If you face Pandas error:

ValueError: malformed node or string on line 1: <ast.Name object at 0x7fc439d3b3a0>

while using .apply(ast.literal_eval).

Then you can try:

df_spm['result'].str.replace('\\"', '', regex=False).replace("\\\'", '', regex=True)\
    .replace(u'‎\u200e', '', regex=True).str.encode('ascii','ignore')\
    .apply(json.loads).apply(pd.Series)

Conclusion

In this article we covered multiple ways to convert JSON data or columns containing JSON data to multiple columns.

We discussed different problems and solutions of most typical problems. It mentioned performance benefits and working with multiple level JSON data.