In this quick tutorial, we'll cover how to apply function to a single column in Pandas.
Here are two ways to apply function to column in DataFrame:
(1) Apply user defined function on column
df['col'].map(my_function)
(2) Apply lambda to function
df['col'].apply(lambda x: x + 1)
For multiple columns check: How to apply function to multiple columns in Pandas.
In the next section we will cover several different use cases and important details on the topic.
Let's say that we have the following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
df
DataFrame:
A | B | C | D |
---|---|---|---|
4 | 2 | 2 | 3 |
1 | 0 | 0 | 3 |
0 | 2 | 1 | 0 |
3 | 2 | 4 | 2 |
1 | 1 | 4 | 1 |
For large DataFrames use Dask or swifter - check Option 4!
Option 1: Pandas apply function to column
The first example will show how to define a function and then apply it on a column from a Pandas DataFrame.
First we will define a function which will be applied on the column by method - pd.apply
. Then we will called that function for column A
:
def my_function(x):
return x ** 2
df['A'].apply(my_function)
The result is squared values for each cell:
0 16
1 1
2 0
3 9
4 1
Name: A, dtype: int64
Option 2: Pandas apply function to column by map
**A better way to apply function to a single column is by using Pandas map
method. **
Why is it better? Because apply
is designed for multiple columns while map
is intended for Pandas Series. A single column from Pandas is equal to a Pandas Series or 1 dimensional array.
Method map
can be slightly faster than apply
for large DataFrames.
So the apply function by map can be done by:
def my_function(x):
return x ** 2
df['A'].map(my_function)
The result is the same as Option 1.
Option 3: Pandas apply anonymous function / lambda to column
Sometimes a lambda or anonymous function is what you would like to apply to a column.
The syntax is very simple:
df['A'].map(lambda x: x ** 2)
or:
df['A'].apply(lambda x: x ** 2)
The difference is the same: apply
method will be applied on DataFrame level while map
is applied on Series level.
So you can do:
df[['A', 'B']].apply(lambda x: x ** 2)
in order to apply lambda on multiple functions.
It's recommended to show intention of what you would like to do: 1) use map for single columns or 2) use apply if in future other columns should be added!
Option 4: Speed up Pandas apply function to column
Finally let's cover how to speed up applying function to single column in Pandas.
To optimize execution there are several options like:
- Dask
- Swifter
You can find more info in the Resource section.
To test all options we will create DataFrame with shape - 10000 rows × 4 columns
.
Dask - apply function to column - the fastest tested way:
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=2)
ddf["A"].apply(fnc, meta=('A', 'int64'))
The result is:
612 µs ± 3.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
swifter - fast apply function to column - it's much faster the method apply
and a bit slower than Dask - in the tested example:
df['A'].swifter.apply(fnc)
result:
752 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Pandas apply
3.63 ms ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pandas map
3.57 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)