To save a DataFrame as a compressed CSV/JSON file using Pandas we can parameter compression='gzip
as follows:
CSV
df.to_csv('data.csv.gz', index=False, compression='gzip')
JSON
df.to_JSON('data.json.gz', index=False, compression='gzip')
Saving a DataFrame as a Compressed CSV
You can save a Pandas DataFrame as a compressed CSV using the compression
parameter in the to_csv()
or to_json()
function. Pandas supports multiple compression formats like:
- gzip
- bz2
- zip
- xz
- zstd
- tar
You can read more on the following link: pandas.DataFrame.to_csv
Example: Saving a DataFrame with gzip Compression
Below you can do a see basic example of on-the-fly compression of the output data:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
df.to_csv('data.csv.gz', index=False, compression='gzip')
Other Compression Formats
You can use different compression formats by changing the compression
parameter:
df.to_csv('data.csv.bz2', index=False, compression='bz2')
df.to_csv('data.zip', index=False, compression='zip')
df.to_csv('data.csv.xz', index=False, compression='xz')
Reading the Compress CSV File
To read a compressed CSV file back into a Pandas DataFrame, use pd.read_csv()
with the compression
parameter:
df = pd.read_csv('data.csv.gz', compression='gzip')
Compression Results
By using compression, you can significantly reduce file size. In my tests I'm working with a file which contains the two columns:
https://www.example.com/south,Q6RnAzwGYA
https://www.example.com/mawson,zwGYAZc
https://www.example.com/sea,ZciVimr4
https://www.example.com/moo,4o6PwPjg
https://www.example.com/paul,Vimr4kvJw
You can find the results below:
- original file is 4.1 GB
- Pandas compression - 689 MB - 1.5 min
It took similar time for Ubuntu default compression which produced the same size - 689 MB.
The advantage of Pandas is that you can exclude some columns and get smaller size after compression.