In this article, we'll explore how to convert columns to categorical in a Pandas DataFrame with practical examples. In data analysis, efficient memory usage and improved performance are crucial considerations.

Conversion column to categorical is simple as:

df['col'].astype('category')

Let's dive into more details.

Why Use Categorical Data Type?

Categorical data type is beneficial when dealing with columns containing a limited and fixed set of unique values. This not only optimizes memory usage but also enhances the performance of certain operations, such as groupby and value_counts.

Note:

Categorical columns can save up to 50 - 80% memory.

Examples 1: Basic Conversion to Categorical

Consider a DataFrame with a column representing different car types. We can convert this column to a categorical type using the astype method:

import pandas as pd

data = {'CarType': ['Sedan', 'SUV', 'Truck', 'Sedan', 'Truck']}
df = pd.DataFrame(data)

df['CarType'] = df['CarType'].astype('category')

Example 2: Specifying Categories and Order

You can specify custom categories and their order using the pd.Categorical constructor. Let's consider a DataFrame with a 'Size' column:

data = {'Size': ['Small', 'Medium', 'Large', 'Small']}
df = pd.DataFrame(data)

df['Size'] = pd.Categorical(df['Size'], categories=['Small', 'Medium', 'Large'], ordered=True)

print(df['Size'].cat.categories)

Check if column is Categorical

To check if a column is categorical we can use:

df.dtypes

You can find the difference before and after conversion:

  • before
    • CarType object
  • converted to categorical
    • CarType category

Check which columns are good for Categorical

There are different ways to find out if a column is suitable for a categorical. Such column contain:

  • limited
  • fixed

set of unique values.

We can use methods like:

df['col'].value_counts()
df.describe(how='all')
pd.get_dummies(s)

To find potential columns for conversion.

Test memory usage gain

To test what is the benefit of using categorical columns in Pandas we will run:

import pandas as pd

data = {'CarType': ['Sedan', 'SUV', 'Truck', 'Sedan', 'Truck'] * 10000}
df = pd.DataFrame(data)

We have two option to find memory usage in Pandas:

df.memory_usage(deep=True)
df.info(memory_usage='deep')

sample output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   CarType  50000 non-null  category
dtypes: category(1)
memory usage: 49.2 KB

Below you can compare the results before and after conversion:

  • before - memory usage: 2.9 MB
  • after - memory usage: 49.2 KB

This simple example shows the great benefit of using categorical columns in Pandas.

Summary

Converting column types to categorical in Pandas is a powerful technique for optimizing memory usage and enhancing data analysis performance. Whether dealing with nominal or ordinal categorical data, Pandas provides versatile tools for customization and conversion.

By incorporating these examples into your data analysis workflow, you can leverage the benefits of categorical data types and efficiently handle large datasets.

Note:

Nominal data involves categorization without any ranking, while ordinal data involves both categorization and ranking.

Resources