To read compressed CSV and JSON files directly without manually decompressing them in Pandas use:
(1) Read a Compressed CSV
pd.read_csv('data.csv.gz', compression='gzip')
(2) Read a Compressed JSON
pd.read_json('data.json.gz', compression='gzip')
This is useful for handling large datasets while saving storage space and improving efficiency - which can save disk space up to 10 times.
Reading a Compressed CSV File with gzip
Use the compression
parameter in pd.read_csv()
to read CSV files in different compressed formats.
Example: Reading a Gzip-Compressed CSV File
import pandas as pd
df = pd.read_csv('data.csv.gz', compression='gzip')
df.head()
Other Supported Compression Formats
You can specify different compression types as needed:
bz2
- Bzip2 compressionzip
- Zip compressionxz
- XZ compression
If the file extension matches the compression format, Pandas can automatically detect it:
```python
df = pd.read_csv('data.csv.gz') # No need to specify compression
Reading a Compressed JSON File
Pandas also supports reading compressed JSON files using pd.read_json()
:
df_json = pd.read_json('data.json.gz', compression='gzip')
Multiple files found in ZIP file or nested folder
If you face error like:
ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['test/', 'test/data.csv']
You will need to specify the file which has to be read:
import zipfile
import pandas as pd
with zipfile.ZipFile("/mnt/x/test.zip") as z:
with z.open("test/data.csv") as f:
df = pd.read_csv(f, header=0, delimiter="\t")
df
In this case you can work with nested ZIP files or multiple files in one zip.
Read Remote Compressed CSV file from URL link
Pandas can read remote CSV files by reading the file with: io.BytesIO(r.read())
import io
from urllib.request import urlopen
import pandas as pd
r = urlopen("https://github.com/softhints/python/raw/refs/heads/master/notebooks/csv/data.csv.zip")
df = pd.read_csv(io.BytesIO(r.read()), sep=',', nrows=3)
df
Detect file encoding
To avoid UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)
errors we can detect what is the file encoding.
import pandas as pd
import chardet
with open('/mnt/x/Datasets/data.csv', 'rb') as f:
# result = chardet.detect(f.read()) # or readline if the file is not too large
result = chardet.detect(f.readline())
result:
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}