When working with large datasets, it's important to estimate how much memory a Pandas DataFrame will consume. This helps optimize performance and prevent memory errors.
(1) Calculate memory usage per column
df.memory_usage()
(2) Interrogating object dtypes for system-level memory consumption
df.memory_usage(deep=True)
(3) Total Memory Usage
print(df.memory_usage(deep=True).sum(), "bytes")
1. Use df.memory_usage()
Pandas provides the memory_usage()
method to calculate memory usage per column:
import pandas as pd
import numpy as np
# Sample DataFrame
df = pd.DataFrame({
"col1": np.random.randint(0, 100, 1000),
"col2": np.random.random(1000),
"col3": [str(i) for i in range(1000)]
})
# Memory usage per column
print(df.memory_usage(deep=True)) # 'deep=True' includes object types
result:
Index 146309736
count 146309736
date 146309736
hour 146309736
url 146309736
...
dtype: int64
2. Estimate Total Memory Usage
To get the total DataFrame size in bytes:
print(df.memory_usage(deep=True).sum(), "bytes")
result:
14610252261 bytes
3. How memory_usage deep=True works
To get the total DataFrame size in bytes:
df.memory_usage(deep=False)
result:
Index 146309736
count 146309736
date 146309736
hour 146309736
url 146309736
netloc 146309736
...
dtype: int64
vs
df.memory_usage(deep=True)
result:
Index 146309736
count 146309736
date 146309736
hour 146309736
url 4123904171
netloc 1005879435
...
dtype: int64
You can notice the difference which is impacting the object columns:
count int64
date datetime64[ns]
hour int64
url object
netloc object
...
dtype: object
4. Optimizing Memory Usage
To reduce memory usage:
- Convert integers to smaller types (
int8
,int16
). - Use categorical types for repetitive strings. Good indicator for this is testing values with
df.describe(include='all')
- Convert floating points to lower precision (
float16
,float32
).
df["col1"] = df["col1"].astype("int16")
df["col3"] = df["col3"].astype("category")
print(df.memory_usage(deep=True).sum(), "bytes after optimization")
after optimizing we can find the difference:
Index 139.53 MB
count 139.53 MB
date 139.53 MB
hour 139.53 MB
url 3.84 GB
netloc 959.28 MB
dtype: object
vs
Index 139.53 MB
count 69.77 MB
date 139.53 MB
hour 17.44 MB
url 3.84 GB
netloc 17.44 MB
dtype: object
which is result of applying:
df.hour = pd.to_numeric(df.hour, downcast='integer')
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['netloc'] = df['netloc'].astype('category')
5. Human readable memory info
This answers is inspired by SO shared in the resource section:
suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def humansize(nbytes):
i = 0
while nbytes >= 1024 and i < len(suffixes)-1:
nbytes /= 1024.
i += 1
f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
return '%s %s' % (f, suffixes[i])
df.memory_usage(index=True, deep=True).apply(humansize)
output:
Index 139.53 MB
count 139.53 MB
date 139.53 MB
hour 139.53 MB
url 3.84 GB
netloc 959.28 MB
dtype: object