How to Estimate the Memory Usage of a Pandas DataFrame

When working with large datasets, it's important to estimate how much memory a Pandas DataFrame will consume. This helps optimize performance and prevent memory errors.

(1) Calculate memory usage per column

df.memory_usage()

(2) Interrogating object dtypes for system-level memory consumption

df.memory_usage(deep=True)

(3) Total Memory Usage

print(df.memory_usage(deep=True).sum(), "bytes")

1. Use `df.memory_usage()`

Pandas provides the memory_usage() method to calculate memory usage per column:

import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({
    "col1": np.random.randint(0, 100, 1000),
    "col2": np.random.random(1000),
    "col3": [str(i) for i in range(1000)]
})

# Memory usage per column
print(df.memory_usage(deep=True))  # 'deep=True' includes object types

result:

Index                 146309736
count                 146309736
date                  146309736
hour                  146309736
url                   146309736
...
dtype: int64

2. Estimate Total Memory Usage

To get the total DataFrame size in bytes:

print(df.memory_usage(deep=True).sum(), "bytes")

result:

14610252261 bytes

3. How memory_usage deep=True works

To get the total DataFrame size in bytes:

df.memory_usage(deep=False)

result:

Index                 146309736
count                 146309736
date                  146309736
hour                  146309736
url                   146309736
netloc                146309736
...
dtype: int64

df.memory_usage(deep=True)

result:

Index                  146309736
count                  146309736
date                   146309736
hour                   146309736
url                   4123904171
netloc                1005879435
...
dtype: int64

You can notice the difference which is impacting the object columns:

count                          int64
date                  datetime64[ns]
hour                           int64
url                           object
netloc                        object
...
dtype: object

4. Optimizing Memory Usage

To reduce memory usage:

Convert integers to smaller types (int8, int16).
Use categorical types for repetitive strings. Good indicator for this is testing values with df.describe(include='all')
Convert floating points to lower precision (float16, float32).

df["col1"] = df["col1"].astype("int16")
df["col3"] = df["col3"].astype("category")
print(df.memory_usage(deep=True).sum(), "bytes after optimization")

after optimizing we can find the difference:

Index                  139.53 MB
count                  139.53 MB
date                   139.53 MB
hour                   139.53 MB
url                      3.84 GB
netloc                 959.28 MB
dtype: object

Index                  139.53 MB
count                   69.77 MB
date                   139.53 MB
hour                    17.44 MB
url                      3.84 GB
netloc                  17.44 MB
dtype: object

which is result of applying:

df.hour = pd.to_numeric(df.hour, downcast='integer')
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['netloc'] = df['netloc'].astype('category')

5. Human readable memory info

This answers is inspired by SO shared in the resource section:

suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def humansize(nbytes):
    i = 0
    while nbytes >= 1024 and i < len(suffixes)-1:
        nbytes /= 1024.
        i += 1
    f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
    return '%s %s' % (f, suffixes[i])

df.memory_usage(index=True, deep=True).apply(humansize)

output:

Index                  139.53 MB
count                  139.53 MB
date                   139.53 MB
hour                   139.53 MB
url                      3.84 GB
netloc                 959.28 MB
dtype: object

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

1. Use df.memory_usage()

2. Estimate Total Memory Usage

3. How memory_usage deep=True works

4. Optimizing Memory Usage

5. Human readable memory info

Resources

1. Use `df.memory_usage()`