How to Read a Compressed CSV or JSON File in Pandas

To read compressed CSV and JSON files directly without manually decompressing them in Pandas use:

(1) Read a Compressed CSV

pd.read_csv('data.csv.gz', compression='gzip')

(2) Read a Compressed JSON

pd.read_json('data.json.gz', compression='gzip')

This is useful for handling large datasets while saving storage space and improving efficiency - which can save disk space up to 10 times.

Reading a Compressed CSV File with gzip

Use the compression parameter in pd.read_csv() to read CSV files in different compressed formats.

Example: Reading a Gzip-Compressed CSV File

import pandas as pd  

df = pd.read_csv('data.csv.gz', compression='gzip')  

df.head()

Other Supported Compression Formats

You can specify different compression types as needed:

bz2 - Bzip2 compression
zip - Zip compression
xz - XZ compression


If the file extension matches the compression format, Pandas can automatically detect it:  

```python
df = pd.read_csv('data.csv.gz')  # No need to specify compression

Reading a Compressed JSON File

Pandas also supports reading compressed JSON files using pd.read_json():

df_json = pd.read_json('data.json.gz', compression='gzip')

Multiple files found in ZIP file or nested folder

If you face error like:

ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['test/', 'test/data.csv']

You will need to specify the file which has to be read:

import zipfile
import pandas as pd
with zipfile.ZipFile("/mnt/x/test.zip") as z:
   with z.open("test/data.csv") as f:
      df = pd.read_csv(f, header=0, delimiter="\t")
df

In this case you can work with nested ZIP files or multiple files in one zip.

Read Remote Compressed CSV file from URL link

Pandas can read remote CSV files by reading the file with: io.BytesIO(r.read())

import io
from urllib.request import urlopen
import pandas as pd

r = urlopen("https://github.com/softhints/python/raw/refs/heads/master/notebooks/csv/data.csv.zip")
df = pd.read_csv(io.BytesIO(r.read()), sep=',', nrows=3)
df

Detect file encoding

To avoid UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128) errors we can detect what is the file encoding.

import pandas as pd
import chardet
with open('/mnt/x/Datasets/data.csv', 'rb') as f:
    # result = chardet.detect(f.read())  # or readline if the file is not too large
    result = chardet.detect(f.readline())

result:

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

Reading a Compressed CSV File with gzip

Example: Reading a Gzip-Compressed CSV File

Other Supported Compression Formats

Reading a Compressed JSON File

Multiple files found in ZIP file or nested folder

Read Remote Compressed CSV file from URL link

Detect file encoding