When working with large datasets, it's important to estimate how much memory a Pandas DataFrame will consume. This helps optimize performance and prevent memory errors.

(1) Calculate memory usage per column

df.memory_usage()

(2) Interrogating object dtypes for system-level memory consumption

df.memory_usage(deep=True)

(3) Total Memory Usage

print(df.memory_usage(deep=True).sum(), "bytes")

1. Use df.memory_usage()

Pandas provides the memory_usage() method to calculate memory usage per column:

import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({
    "col1": np.random.randint(0, 100, 1000),
    "col2": np.random.random(1000),
    "col3": [str(i) for i in range(1000)]
})

# Memory usage per column
print(df.memory_usage(deep=True))  # 'deep=True' includes object types

result:

Index                 146309736
count                 146309736
date                  146309736
hour                  146309736
url                   146309736
...
dtype: int64

2. Estimate Total Memory Usage

To get the total DataFrame size in bytes:

print(df.memory_usage(deep=True).sum(), "bytes")

result:

14610252261 bytes

3. How memory_usage deep=True works

To get the total DataFrame size in bytes:

df.memory_usage(deep=False)

result:

Index                 146309736
count                 146309736
date                  146309736
hour                  146309736
url                   146309736
netloc                146309736
...
dtype: int64

vs

df.memory_usage(deep=True)

result:

Index                  146309736
count                  146309736
date                   146309736
hour                   146309736
url                   4123904171
netloc                1005879435
...
dtype: int64

You can notice the difference which is impacting the object columns:

count                          int64
date                  datetime64[ns]
hour                           int64
url                           object
netloc                        object
...
dtype: object

4. Optimizing Memory Usage

To reduce memory usage:

  • Convert integers to smaller types (int8, int16).
  • Use categorical types for repetitive strings. Good indicator for this is testing values with df.describe(include='all')
  • Convert floating points to lower precision (float16, float32).
df["col1"] = df["col1"].astype("int16")
df["col3"] = df["col3"].astype("category")
print(df.memory_usage(deep=True).sum(), "bytes after optimization")

after optimizing we can find the difference:

Index                  139.53 MB
count                  139.53 MB
date                   139.53 MB
hour                   139.53 MB
url                      3.84 GB
netloc                 959.28 MB
dtype: object

vs

Index                  139.53 MB
count                   69.77 MB
date                   139.53 MB
hour                    17.44 MB
url                      3.84 GB
netloc                  17.44 MB
dtype: object

which is result of applying:

df.hour = pd.to_numeric(df.hour, downcast='integer')
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['netloc'] = df['netloc'].astype('category')

5. Human readable memory info

This answers is inspired by SO shared in the resource section:

suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def humansize(nbytes):
    i = 0
    while nbytes >= 1024 and i < len(suffixes)-1:
        nbytes /= 1024.
        i += 1
    f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
    return '%s %s' % (f, suffixes[i])

df.memory_usage(index=True, deep=True).apply(humansize)

output:

Index                  139.53 MB
count                  139.53 MB
date                   139.53 MB
hour                   139.53 MB
url                      3.84 GB
netloc                 959.28 MB
dtype: object

Resources