In this article, we'll explore how to convert columns to categorical in a Pandas DataFrame with practical examples. In data analysis, efficient memory usage and improved performance are crucial considerations.
Conversion column to categorical is simple as:
df['col'].astype('category')
Let's dive into more details.
Why Use Categorical Data Type?
Categorical data type is beneficial when dealing with columns containing a limited and fixed set of unique values. This not only optimizes memory usage but also enhances the performance of certain operations, such as groupby and value_counts.
Categorical columns can save up to 50 - 80% memory.
Examples 1: Basic Conversion to Categorical
Consider a DataFrame with a column representing different car types. We can convert this column to a categorical type using the astype method:
import pandas as pd
data = {'CarType': ['Sedan', 'SUV', 'Truck', 'Sedan', 'Truck']}
df = pd.DataFrame(data)
df['CarType'] = df['CarType'].astype('category')
Example 2: Specifying Categories and Order
You can specify custom categories and their order using the pd.Categorical constructor. Let's consider a DataFrame with a 'Size' column:
data = {'Size': ['Small', 'Medium', 'Large', 'Small']}
df = pd.DataFrame(data)
df['Size'] = pd.Categorical(df['Size'], categories=['Small', 'Medium', 'Large'], ordered=True)
print(df['Size'].cat.categories)
Check if column is Categorical
To check if a column is categorical we can use:
df.dtypes
You can find the difference before and after conversion:
- before
- CarType object
- converted to categorical
- CarType category
Check which columns are good for Categorical
There are different ways to find out if a column is suitable for a categorical. Such column contain:
- limited
- fixed
set of unique values.
We can use methods like:
df['col'].value_counts()
df.describe(how='all')
pd.get_dummies(s)
To find potential columns for conversion.
Test memory usage gain
To test what is the benefit of using categorical columns in Pandas we will run:
import pandas as pd
data = {'CarType': ['Sedan', 'SUV', 'Truck', 'Sedan', 'Truck'] * 10000}
df = pd.DataFrame(data)
We have two option to find memory usage in Pandas:
df.memory_usage(deep=True)
df.info(memory_usage='deep')
sample output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CarType 50000 non-null category
dtypes: category(1)
memory usage: 49.2 KB
Below you can compare the results before and after conversion:
- before - memory usage: 2.9 MB
- after - memory usage: 49.2 KB
This simple example shows the great benefit of using categorical columns in Pandas.
Summary
Converting column types to categorical in Pandas is a powerful technique for optimizing memory usage and enhancing data analysis performance. Whether dealing with nominal or ordinal categorical data, Pandas provides versatile tools for customization and conversion.
By incorporating these examples into your data analysis workflow, you can leverage the benefits of categorical data types and efficiently handle large datasets.
Nominal data involves categorization without any ranking, while ordinal data involves both categorization and ranking.