In this short post, I'll show you how to get most frequent values in Pandas DataFrame.
You can find also: How to Get Top 10 Highest or Lowest Values in Pandas
To start, here are the ways to get most frequent N values in your DataFrame:
df['Magnitude'].value_counts()
df['Magnitude'].mode()
In the next steps we will cover more details in simple examples
Step 1: Create Sample DataFrame
To start, let's create DataFrame with data from Kaggle:Significant Earthquakes, 1965-2016.
To learn more about Pandas and Kaggle: How to Search and Download Kaggle Dataset to Pandas DataFrame
import pandas as pd
cols = ['Date', 'Time', 'Latitude', 'Longitude', 'Depth', 'Magnitude Type', 'Type', 'ID', 'Magnitude']
df = pd.read_csv(f'../data/earthquakes_1965_2016_database.csv.zip')[cols]
data:
Date | Latitude | Longitude | Depth | Magnitude Type | Magnitude |
---|---|---|---|---|---|
01/02/1965 | 19.246 | 145.616 | 131.6 | MW | 6.0 |
01/04/1965 | 1.863 | 127.352 | 80.0 | MW | 5.8 |
01/05/1965 | -20.579 | -173.972 | 20.0 | MW | 6.2 |
01/08/1965 | -59.076 | -23.557 | 15.0 | MW | 5.8 |
01/09/1965 | 11.938 | 126.427 | 15.0 | MW | 5.8 |
Step 2: Get Most Frequent value of Column in Pandas
To get the most frequent value of a column we can use the method mode
. It will return the value that appears most often. It can be multiple values.
So to get the most frequent value in a single column - 'Magnitude' we can use:
df['Magnitude'].mode()
result:
0 5.5
dtype: float64
It has few parameters like:
numeric_only
dropna
axis
If the most frequent values are multiple they will be returned:
df['Time'].mode()
result:
0 02:56:58
1 14:09:03
dtype: object
Step 3: Get Most Frequent value for all columns in Pandas
In order to get the most frequent value for all columns we can iterate through all columns of the DataFrame:
dfs = []
for col in df.columns:
top_values = []
top_values = df[col].mode()
dfs.append(pd.DataFrame({col: top_values}).reset_index(drop=True))
pd.concat(dfs, axis=1)
the most frequent values for all columns:
Date | Time | Depth | Magnitude Type | Type | Magnitude |
---|---|---|---|---|---|
03/11/2011 | 02:56:58 | 10.0 | MW | Earthquake | 5.5 |
NaN | 14:09:03 | NaN | NaN | NaN | NaN |
Or get the most frequent values per dtype-s - for example only numeric columns:
from pandas.api.types import is_numeric_dtype
dfs = []
for col in df.columns:
top_values = []
if is_numeric_dtype(df[col]):
top_values = df[col].mode()
dfs.append(pd.DataFrame({col: top_values}).reset_index(drop=True))
pd.concat(dfs, axis=1)
The most frequent values for numeric columns
Depth | Magnitude |
---|---|
10.0 | 5.5 |
Step 4: Get N most frequent values in a column
To get 5, 10 or N most frequent values in a single column we can use method value_counts
:
df['Magnitude'].value_counts()
it will return the frequency for all values in the column:
5.50 4685
5.60 3967
5.70 3079
5.80 2346
5.90 1947
6.00 1580
....
Name: Magnitude, dtype: int64
To get the N most frequent values only without the count we can use:
n = 5
df['Magnitude'].value_counts().index.tolist()[:n]
result:
[5.5, 5.6, 5.7, 5.8, 5.9]
To get only the count:
n = 5
df['Magnitude'].value_counts().values.tolist()[:n]
result:
[4685, 3967, 3079, 2346, 1947]
To get the single most frequent value from value_counts
we can combine it with idmax
:
df['Magnitude'].value_counts().idxmax()
output:
5.5
To get the count of the most frequent value:
df['Magnitude'].value_counts().max()
output:
4685
Step 5: Get Top 10 most frequent values for all columns
Finally let's cover the case - how to get the most frequent values for all columns in DataFrame.
For this purpose we are going to use loop:
from pandas.api.types import is_categorical_dtype
for col in df.columns:
print(col, end=' - \n')
print('_' * 50)
if col in ['Magnitude'] or is_categorical_dtype(col):
display(pd.DataFrame(df[col].astype('str').value_counts().sort_values(ascending=False).head(3)))
else:
display(pd.DataFrame(df[col].value_counts().sort_values(ascending=False).head(5)))
In the example above we are applying different behaviour for categorical columns. Because value_counts can't be applied directly on them.
result:
Date -
Date
03/11/2011 128
12/26/2004 51
02/27/2010 39
Time -
Time
02:56:58 5
14:09:03 5
16:25:34 4