To read compressed CSV and JSON files directly without manually decompressing them in Pandas use:

(1) Read a Compressed CSV

pd.read_csv('data.csv.gz', compression='gzip') 

(2) Read a Compressed JSON

pd.read_json('data.json.gz', compression='gzip') 

This is useful for handling large datasets while saving storage space and improving efficiency - which can save disk space up to 10 times.

Reading a Compressed CSV File with gzip

Use the compression parameter in pd.read_csv() to read CSV files in different compressed formats.

Example: Reading a Gzip-Compressed CSV File

import pandas as pd  

df = pd.read_csv('data.csv.gz', compression='gzip')  

df.head()

Other Supported Compression Formats

You can specify different compression types as needed:

  • bz2 - Bzip2 compression
  • zip - Zip compression
  • xz - XZ compression

If the file extension matches the compression format, Pandas can automatically detect it:  

```python
df = pd.read_csv('data.csv.gz')  # No need to specify compression

Reading a Compressed JSON File

Pandas also supports reading compressed JSON files using pd.read_json():

df_json = pd.read_json('data.json.gz', compression='gzip') 

Multiple files found in ZIP file or nested folder

If you face error like:

ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['test/', 'test/data.csv']

You will need to specify the file which has to be read:

import zipfile
import pandas as pd
with zipfile.ZipFile("/mnt/x/test.zip") as z:
   with z.open("test/data.csv") as f:
      df = pd.read_csv(f, header=0, delimiter="\t")
df

In this case you can work with nested ZIP files or multiple files in one zip.

Pandas can read remote CSV files by reading the file with: io.BytesIO(r.read())

import io
from urllib.request import urlopen
import pandas as pd

r = urlopen("https://github.com/softhints/python/raw/refs/heads/master/notebooks/csv/data.csv.zip")
df = pd.read_csv(io.BytesIO(r.read()), sep=',', nrows=3)
df

Detect file encoding

To avoid UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128) errors we can detect what is the file encoding.

import pandas as pd
import chardet
with open('/mnt/x/Datasets/data.csv', 'rb') as f:
    # result = chardet.detect(f.read())  # or readline if the file is not too large
    result = chardet.detect(f.readline()) 

result:

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}