In this article, we will see how to convert JSON or string representation of dictionaries in Pandas.

JSON(JavaScript Object Notation) data and dictionaries can be stored and imported in different ways. This might result in unexpected results or need to convert them to new columns.

In order to convert JSON, dicts and lists to tabular form we can use several different options. Let's cover the most popular of them in next steps.

Setup

In the post, we'll use the following DataFrame, which has columns:

  • col_json - JSON column stored as JSON
  • col_str_dict - JSON column stored as a string
  • col_str_dict_list - JSON column with nested list
import pandas as pd
data = {'col_json': {0: {'x': 1, 'y': 0, 'xy':1},
                     1: {'x': 0, 'y': 1, 'xy':1},
                     2: {'x': 1, 'y': 1, 'xy':1}},
        'col_str_dict': {0: "{'x': 1, 'y': 0, 'xy':1}",
                         1: "{'x': 0, 'y': 1, 'xy':1}",
                         2: "{'x': 1, 'y': 1, 'xy':1}"}
        ,
        'col_str_dict_list': {0: '{"x": [1,1]}',
                         1: '{"x": [0,1]}',
                         2: '{"x": [1,0]}'}
       }
df = pd.DataFrame(data)
df

It's not possible to distinguish how JSON is stored from the dtypes:

df.dtypes

results into:

col_json             object
col_str_dict         object
col_str_dict_list    object
dtype: object

DataFrame data looks like:

col_json col_str_dict col_str_dict_list
0 {'x': 1, 'y': 0, 'xy': 1} {'x': 1, 'y': 0, 'xy':1} {"x": [1,1]}
1 {'x': 0, 'y': 1, 'xy': 1} {'x': 0, 'y': 1, 'xy':1} {"x": [0,1]}
2 {'x': 1, 'y': 1, 'xy': 1} {'x': 1, 'y': 1, 'xy':1} {"x": [1,0]}

Step 1: Normalize simple JSON with pd.json_normalize()

Since Pandas version 1.2.4 there is new method to normalize JSON data: pd.json_normalize()

It can be used to convert a JSON column to multiple columns:

pd.json_normalize(df['col_json'])

this will result into new DataFrame with values stored in the JSON:

x y xy
0 1 0 1
1 0 1 1
2 1 1 1

The method pd.json_normalize has several parameters like:

  • record_path - Path in each object to list of records. If not passed, data will be assumed to be an array of records.
  • meta - Fields to use as metadata for each record in the resulting table.
  • errors - Configures error handling - ignore or raise
  • max_level - Max number of levels(depth of dict) to normalize. if None, normalizes all levels.

How to use them in more complex examples is covered in Step 5.

Step 2: Flattening JSON with .apply(pd.Series)

As an alternative solution we can use .apply(pd.Series). The result is the same:

df['col_json'].apply(pd.Series)

this will result into new DataFrame with values stored in the JSON:

x y xy
0 1 0 1
1 0 1 1
2 1 1 1
× Pro Tip
Method pd.json_normalize is faster than
.apply(pd.Series)
- tested by `%%timeit`

Compare performance of json_normalize and .apply(pd.Series) :

  • json_normalize - 361 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • .apply(pd.Series) - 1.29 ms ± 21.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Step 3: Parse JSON data with json.loads and ast.literal_eval

Methods ast.literal_eval and json.loads help us to parse JSON data.

They can help us when we need to read and parse JSON stored as string.

Let's see them how they work and what is the key difference:

json.loads('{"x": 1}')

result:

{'x': 1}
ast.literal_eval('{"x": 1}')

we get the same result:

{'x': 1}
× Pro Tip
Method json.loads can work only with valid JSONs while
ast.literal_eval
can load wider variety of JSON like syntax

Which means that:

json.loads("{'x': 1}")

will error with:

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

But we can convert non standard JSON data by ast.literal_eval:

ast.literal_eval("{'x': 1}")

Step 4: Flattening JSON stored as string

What if we like to normalize JSON which is stored as string in Pandas column.

Using previous steps will not help. It will result in a single column named 0.

In order to convert dict or JSON stored as a string to multiple columns we can use combination of:

  • pd.json_normalize
  • ast.literal_eval or json.loads (for strict JSON format)
import ast
df['col_str_dict'].apply(ast.literal_eval)

This will result into:

x y xy
0 1 0 1
1 0 1 1
2 1 1 1

Step 5: Flattening complex JSON data with pd.json_normalize

Suppose we have data as follow:

data = [
    {
        'pos': 'east',
        'population': 55,
        'continent': 'Africa',
        'info': {
            'people': {
                'environmentalist': 'Wangari Maathai',
                'journalist': 'Ngugi wa Thiong''o'
            }
        },
        'city': [
            {
                'name': 'Nairobi',
                'capital': True,
                'population': { 'total': 4.4, 'rank': 11 }
            },
            {
                'name': 'tigerfish',
                'capital': False,
                'population': { 'total': 1.2, 'rank': 12 }
            },
        ]
    },
    {
        'pos': 'central',
        'population': 218,
        'continent': 'Africa',
        'people': {
            'famous': {
                'writer': 'Ken Saro-Wiwa',
                'artist': 'Tiwa Savage'
            }
        },
        'city': [
            { 'name': 'Lagos', 'capital': True },
            { 'name': 'Kano', 'capital': False },
        ]
    }
]

The table represents the data stored as a DataFrame:

pos population continent info city people
0 east 55 Africa {'people': {'environmentalist': 'Wangari Maathai', 'journalist': 'Ngugi wa Thiongo'}} [{'name': 'Nairobi', 'capital': True, 'population': {'total': 4.4, 'rank': 11}}, {'name': 'tigerfish', 'capital': False, 'population': {'total': 1.2, 'rank': 12}}] NaN
1 central 218 Africa NaN [{'name': 'Lagos', 'capital': True}, {'name': 'Kano', 'capital': False}] {'famous': {'writer': 'Ken Saro-Wiwa', 'artist': 'Tiwa Savage'}}

If we need to specify the path and the metadata we can do:

pd.json_normalize(
    data,
    record_path =['city'],
    meta=['pos', 'continent', 'population'],
    errors='ignore'
)

which will result into normalized form:

name capital population.total population.rank pos continent population
0 Nairobi True 4.4 11.0 east Africa 55
1 tigerfish False 1.2 12.0 east Africa 55
2 Lagos True NaN NaN central Africa 218
3 Kano False NaN NaN central Africa 218

Step 6: Flattening a JSON with multiple levels or nested lists

There is a Python library accessible at: flatten-json.

It can be installed by:

pip install flatten-json

The library is described as:

Flattens JSON objects in Python. flatten_json flattens the hierarchy in your object which can be useful if you want to force your objects into a table.

For this step we are going to create additional DataFrame:

data = {'col_json': {0: {"x": 1.0, "y": 0.0},
                     1: {"x": 0.0, "y": 1.0},
                     2: {"x": 1.0, "y": 1.0}}
       }
df2 = pd.DataFrame(data)

To demonstrate how we can flatten JSON objects in Python and Pandas:

from flatten_json import flatten
df2['col_json'].apply(flatten).apply(pd.Series)

the result:

x y
0 1.0 0.0
1 0.0 1.0
2 1.0 1.0

Step 7: Parse local or remote JSON file

If JSON data is stored as a file - locally or remotely we can normalize it with few additional lines:

Normalize local JSON file in Pandas

The code below will load and normalize local file in the same folder as the script:

import json

with open('data.json','r') as f:
    data = json.loads(f.read())
    
pd.json_normalize(data)

Normalize remote JSON file in Python

In order to load and normalize JSON data from a remote file we can use the following code:

import requests
URL = 'https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/sample.json'
data = json.loads(requests.get(URL).text)

pd.json_normalize(data)

Step 8: Concatenate normalized JSON data to the original DataFrame

If you need to concatenate the normalized or flattened data to the original DataFrame we can use method concat:

import json
pd.concat([df, df['col_str_dict_2'].apply(json.loads).apply(pd.Series)], axis=1)

and apply the operations on columns:

col_json col_str_dict col_str_dict_list x
0 {'x': 1, 'y': 0, 'xy': 1} {'x': 1, 'y': 0, 'xy':1} {"x": [1,1]} [1, 1]
1 {'x': 0, 'y': 1, 'xy': 1} {'x': 0, 'y': 1, 'xy':1} {"x": [0,1]} [0, 1]
2 {'x': 1, 'y': 1, 'xy': 1} {'x': 1, 'y': 1, 'xy':1} {"x": [1,0]} [1, 0]

TypeError: the JSON object must be str, bytes or bytearray, not 'float'

Sometimes we might get errors like:

  • TypeError: the JSON object must be str, bytes or bytearray, not 'float'
  • TypeError: the JSON object must be str, bytes or bytearray, not dict

Different reasons can cause similar errors. This error will be raised if we try to apply json.loads to a JSON data:

df2['col_json'].apply(json.loads)

To avoid such errors we might convert the column to string or parse it by library flatten_json:

from flatten_json import flatten
df2['col_json'].apply(flatten).apply(pd.Series)

TypeError: string indices must be integers

This error may be the result of misuse of the method: pd.json_normalize(). Be sure to pass JSON data.

If we are passing DataFrame then we need to convert it to proper JSON by:

data.to_dict()

Conclusion

In this article we covered multiple ways to convert JSON data or columns containing JSON data to multiple columns.

We discussed different problems and solutions of most typical problems. It mentioned performance benefits and working with multiple level JSON data.