In this tutorial, we'll learn how to normalize columns or the whole DataFrame in Pandas. We will show different ways like:
(1) Min Max normalization
for whole DataFrame
(df-df.min())/(df.max()-df.min())
for column:
(df['col'] - df['col'].mean())/df['col'].std()
(2) Mean normalization
(df-df.mean())/df.std()
(3) biased normalization
scaler.fit_transform(df.iloc[:,:].to_numpy())
Let's cover all examples in more detail.
Setup
For this post we are creating example DataFrame with 3 numeric columns:
import pandas as pd
data = {'day': [1, 2, 3, 4, 5, 6, 7, 8],
'temp': [9, 8, 6, 13, 10, 15, 9, 10],
'humidity': [0.89, 0.86, 0.54, 0.73, 0.45, 0.63, 0.95, 0.67]}
df = pd.DataFrame(data=data)
Data looks like:
day | temp | humidity | |
---|---|---|---|
0 | 1 | 9 | 0.89 |
1 | 2 | 8 | 0.86 |
2 | 3 | 6 | 0.54 |
3 | 4 | 13 | 0.73 |
4 | 5 | 10 | 0.45 |
1: Min Max normalization in Pandas
So let's start by min max normalization (called also min max scaling) in Pandas and Python.
Single column
To do min max scaling for a single column we can do:
(df['humidity']-df['humidity'].min())/(df['humidity'].max()-df['humidity'].min())
The result is normalized Series:
0 0.88
1 0.82
2 0.18
3 0.56
4 0.00
5 0.36
6 1.00
7 0.44
Name: humidity, dtype: float64
Checking data next to the original column:
humidity_norm | humidity | |
---|---|---|
0 | 0.88 | 0.89 |
1 | 0.82 | 0.86 |
2 | 0.18 | 0.54 |
3 | 0.56 | 0.73 |
4 | 0.00 | 0.45 |
All columns
To normalize all columns of a DataFrame we can use:
(df-df.min())/(df.max()-df.min())
Which will result into:
day | temp | humidity | |
---|---|---|---|
0 | 0.000000 | 0.333333 | 0.88 |
1 | 0.142857 | 0.222222 | 0.82 |
2 | 0.285714 | 0.000000 | 0.18 |
3 | 0.428571 | 0.777778 | 0.56 |
4 | 0.571429 | 0.444444 | 0.00 |
2: Mean normalization in Pandas
Next we can see how to do mean normalization in Pandas and Python.
Single column
For a single column we can apply mean normalization by:
(df['humidity'] - df['humidity'].mean())/df['humidity'].std()
The result and the original values:
humidity_norm | humidity | |
---|---|---|
0 | 0.993475 | 0.89 |
1 | 0.823165 | 0.86 |
2 | -0.993475 | 0.54 |
3 | 0.085155 | 0.73 |
4 | -1.504406 | 0.45 |
All columns
To normalize the whole DataFrame with mean normalization we can do:
(df-df.mean())/df.std()
result:
day | temp | humidity | |
---|---|---|---|
0 | -1.428869 | -0.353553 | 0.993475 |
1 | -1.020621 | -0.707107 | 0.823165 |
2 | -0.612372 | -1.414214 | -0.993475 |
3 | -0.204124 | 1.060660 | 0.085155 |
4 | 0.204124 | 0.000000 | -1.504406 |
3: Biased normalization in Pandas
To perform biased normalization in Pandas we can use the library sklearn
. The results will differ from the Pandas normalization.
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(df.to_numpy())
The results are:
0 | 1 | 2 | |
---|---|---|---|
0 | -1.527525 | -0.377964 | 1.062070 |
1 | -1.091089 | -0.755929 | 0.880001 |
2 | -0.654654 | -1.511858 | -1.062070 |
3 | -0.218218 | 1.133893 | 0.091035 |
4 | 0.218218 | 0.000000 | -1.608277 |
4: Normalize rows in Pandas
There are multiple ways to normalize rows:
- per sum
- mean
- min max
Normalize rows by their sum
To normalize row based on the sum of the row in Pandas we can do:
df.div(df.sum(axis=1), axis=0)
which will give use:
day | temp | humidity | |
---|---|---|---|
0 | 0.091827 | 0.826446 | 0.081726 |
1 | 0.184162 | 0.736648 | 0.079190 |
2 | 0.314465 | 0.628931 | 0.056604 |
3 | 0.225606 | 0.733221 | 0.041173 |
4 | 0.323625 | 0.647249 | 0.029126 |
Transpose
To normalize row wise in Pandas we can combine:
.T
to transpose rows to columnsdf.values
to get the values as numpy array
Let's see an example:
import pandas as pd
from sklearn import preprocessing
data = df.T.values
scaler = preprocessing.MinMaxScaler()
pd.DataFrame(scaler.fit_transform(data)).T
So after using df.values
we get:
array([[0.0135635 , 1. , 0. ],
[0.15966387, 1. , 0. ],
[0.45054945, 1. , 0. ],
[0.26650367, 1. , 0. ],
[0.47643979, 1. , 0. ],
[0.3736952 , 1. , 0. ],
[0.7515528 , 1. , 0. ],
[0.78563773, 1. , 0. ]])
which are transformed to:
array([[0. , 0.33333333, 0.88 ],
[0.14285714, 0.22222222, 0.82 ],
[0.28571429, 0. , 0.18 ],
[0.42857143, 0.77777778, 0.56 ],
[0.57142857, 0.44444444, 0. ],
[0.71428571, 1. , 0.36 ],
[0.85714286, 0.33333333, 1. ],
[1. , 0.44444444, 0.44 ]])
Conclusion
In this article we learned how to normalize columns and DataFrame in Pandas. Different ways of normalization were covered like - biased, unbiased, normalization per sum.
We also saw how to normalize rows of a DataFrame. Normalizing data is very useful in machine learning and visualizing data.