List of Aggregation Functions(aggfunc) for GroupBy in Pandas

In this article, you can find the list of the available aggregation functions for groupby in Pandas:

count / nunique – non-null values / count number of unique values
min / max – minimum/maximum
first / last - return first or last value per group
unique - all unique values from the group
std – standard deviation
sum – sum of values
mean / median / mode – mean/median/mode
var - unbiased variance
mad - mean absolute deviation
skew - unbiased skew
sem - standard error of the mean
quantile

Those functions can be used with groupby in order to return statistical information about the groups.

In the next section we will cover all aggregation functions with simple examples.

Step 1: Create DataFrame for aggfunc

Let us use the earthquake dataset. We are going to create new column year_month and groupby by it:

import pandas as pd

df = pd.read_csv(f'../data/earthquakes_1965_2016_database.csv.zip')
cols = ['Date', 'Time', 'Latitude', 'Longitude', 'Depth', 'Magnitude Type', 'Type', 'ID']
df = df[cols]

result:

Date	Latitude	Longitude	Depth	Type
01/02/1965	19.246	145.616	131.6	Earthquake
01/04/1965	1.863	127.352	80.0	Earthquake
01/05/1965	-20.579	-173.972	20.0	Earthquake
01/08/1965	-59.076	-23.557	15.0	Earthquake
01/09/1965	11.938	126.427	15.0	Earthquake

Next we are going to create new column with information - combination of the year and the month:

df['Date'] = pd.to_datetime(df['Date'], utc=True)
df['year_month'] = df['Date'].dt.to_period('M')

Step 2: Pandas describe DataFrame

In each step we will see examples of using each of the aggregating functions associated with Pandas groupby function.

We will start with the method describe. This method returns basic information about the column:

df.groupby('year_month')['Depth'].describe()

describe returns multiple aggfunc-s like: count, mean, std, min, max:

count	mean	std	min	25%	50%	75%	max
13.0	101.115385	152.237697	15.0	20.000	35.0	95.000	565.0
54.0	47.712963	80.122216	10.0	20.075	25.1	34.375	482.9
38.0	62.055263	97.890288	10.0	25.000	31.3	47.850	560.8
33.0	112.163636	183.812083	10.0	25.000	30.7	64.700	635.0
22.0	80.972727	114.894839	10.0	25.875	35.0	105.000	553.8

Step 3: Pandas all aggfunc for DataFrame

In this step you can find examples for all aggfunc-s applied on a DataFrame. The list of the functions is below.

Note that by default method groupby will exclude all NaN values. In order to change this behavior you can use parameter - dropna=False

aggfuncs = [ 'count', 'sum', 'sem', 'skew', 'mean', 'min', 'max', 
'std', 'quantile', 'nunique', 'mad', 'size', pd.Series.mode, 'var', 'unique']
df.groupby('year_month', dropna=False)['Depth'].agg(aggfuncs)

result:

year_month	1965-01	1965-02
count	13	54
sum	1314.5	2576.5
sem	42.22314	10.903253
skew	2.744523	4.256526
mean	101.115385	47.712963
min	15.0	10.0
max	565.0	482.9
std	152.237697	80.122216
quantile	35.0	25.1
nunique	9	30
mad	95.56213	39.163306
size	13	54
mode	20.0	25.0
var	23176.31641	6419.569451
unique	[131.6, 80.0, 20.0, 15.0, 35.0, 95.0, 565.0, 227.9, 55.0]	[482.9, 15.0, 10.0, 30.3, 30.0, 25.0, 20.0, 24.0, 31.8, 39.5, 30.4, 17.8, 27.7, 30.1, 37.4, 17.5, 22.5, 25.2, 17.7, 32.5, 200.0, 100.0, 53.5, 340.0, 55.0, 35.0, 40.4, 40.0, 20.3, 160.0]

Step 4: Pandas aggfunc - Count, Nunique, Size, Unique

In this step we will cover 4 aggregation functions:

count - compute count of group, excluding missing values
size - compute group sizes
unique - return unique values
nunique - return number of unique elements in the group.

Example of using the functions and the result:

aggfuncs = [ 'count', 'size', 'nunique', 'unique']
df.groupby('year_month')['Depth'].agg(aggfuncs)

output:

	count	size	nunique	unique
year_month
1965-01	13	13	9	[131.6, 80.0, 20.0, 15.0, 35.0, 95.0, 565.0, 227.9, 55.0]
1965-02	54	54	30	[482.9, 15.0, 10.0, 30.3, 30.0, 25.0, 20.0, 24.0, 31.8, 39.5, 30.4, 17.8, 27.7, 30.1, 37.4, 17.5, 22.5, 25.2, 17.7, 32.5, 200.0, 100.0, 53.5, 340.0, 55.0, 35.0, 40.4, 40.0, 20.3, 160.0]
1965-03	38	38	24	[30.0, 40.0, 33.6, 105.2, 10.0, 15.0, 14.8, 25.7, 200.0, 35.0, 560.8, 45.0, 50.0, 20.0, 207.8, 32.1, 25.0, 224.9, 28.9, 30.5, 48.8, 55.0, 70.0, 75.0]
1965-04	33	33	22	[60.0, 10.0, 30.7, 25.0, 20.0, 39.2, 50.0, 65.0, 17.5, 543.7, 635.0, 570.0, 421.7, 165.0, 15.0, 35.0, 70.0, 30.0, 36.8, 37.1, 64.7, 480.0]
1965-05	22	22	16	[15.0, 22.5, 90.0, 110.0, 10.0, 50.0, 153.1, 25.0, 35.0, 68.4, 120.0, 553.8, 28.5, 30.1, 125.0, 150.0]

Step 5: Pandas aggfunc - First and Last

There are two functions which can return the first or the last value of the group. They are:

first - compute first of group values
last - compute first of group values

Example of their usage:

aggfuncs = [ 'first', 'last']
df.groupby('year_month')['Depth'].agg(aggfuncs)

result:

	first	last
year_month
1965-01	131.6	55.0
1965-02	482.9	10.0
1965-03	30.0	75.0
1965-04	60.0	480.0
1965-05	15.0	150.0

Step 6: Pandas aggfunc - Sum, Min, Max

For numeric or datetime columns we can get the minimum, maximum or the sum by those aggfunc-s:

sum - compute sum of group values
min - compute min of group values
max - compute max of group values

How to get the sum, maximum and the minimum per group:

aggfuncs = [ 'sum', 'min', 'max']
df.groupby('year_month')['Depth'].agg(aggfuncs)

the result is:

	sum	min	max
year_month
1965-01	1314.5	15.0	565.0
1965-02	2576.5	10.0	482.9
1965-03	2358.1	10.0	560.8
1965-04	3701.4	10.0	635.0
1965-05	1781.4	10.0	553.8

Step 7: Pandas aggfunc - Mean, Median, Mode

There are several very important statistics which are:

The mean is the average of a group values
The mode is the most common number in a group
The median is the middle of the group values

They are implemented in Pandas as functions:

mean - compute mean of groups, excluding missing values
pd.Series.mode - return the mode(s) of the Series.
median - compute median of groups, excluding missing values.

They can be compute on Pandas groupby object by next syntax:

aggfuncs = [ 'mean', 'median', pd.Series.mode]
df.groupby('year_month')['Depth'].agg(aggfuncs)

result:

	mean	median	mode
year_month
1965-01	101.115385	35.0	20.0
1965-02	47.712963	25.1	25.0
1965-03	62.055263	31.3	30.0
1965-04	112.163636	30.7	25.0
1965-05	80.972727	35.0	35.0

Step 7: Pandas aggfunc - STD, MAD, Var

Another important methods in statistics are:

std - compute standard deviation of groups, excluding missing value
var - compute variance of groups, excluding missing values
mad - return the mean absolute deviation of the values over the requested axis

How to calculate the standard deviation, variance and mean absolute deviation of groups:

aggfuncs = ['mad', 'std', 'var']
df.groupby('year_month')['Depth'].agg(aggfuncs)

output:

	mad	std	var
year_month
1965-01	95.562130	152.237697	23176.316410
1965-02	39.163306	80.122216	6419.569451
1965-03	53.121745	97.890288	9582.508485
1965-04	129.843526	183.812083	33786.881761
1965-05	66.826446	114.894839	13200.823983

Step 7: Pandas aggfunc - Skew, Sem, quantile

Let's check few other functions which are not very popular like:

skew - return unbiased skew over requested axis
sem - compute standard error of the mean of groups, excluding missing values
quantile - return group values at the given quantile, a la numpy.percentile

aggfuncs = [ 'skew', 'sem', 'quantile']
df.groupby('year_month')['Depth'].agg(aggfuncs)

result:

	skew	sem	quantile
year_month
1965-01	2.744523	42.223140	35.0
1965-02	4.256526	10.903253	25.1
1965-03	4.036620	15.879902	31.3
1965-04	2.054374	31.997576	30.7
1965-05	3.604639	24.495662	35.0

Step 8: User defined aggfunc for groupby

It's possible in Pandas to define your own aggfunc and use it with a groupby method.

In the next example we will define a function which will compute the NaN values in each group:

def countna(x):
    return (x.isna()).sum()

df.groupby('year_month')['Depth'].agg([countna])

result:

	countna
year_month
1965-01	0
1965-02	0
1965-03	0
1965-04	0
1965-05	0

Step 9: Pandas aggfuncs from scipy or numpy

Finally let's check how to use aggregation functions with groupby from scipy or numpy

Below you can find a scipy example applied on Pandas groupby object:

from scipy import stats

df.groupby('year_month')['Depth'].agg(lambda x: stats.mode(x)[0])

result:

year_month
1965-01    20.0
1965-02    25.0
1965-03    30.0
1965-04    25.0

Example for numpy.count_nonzero method used with Pandas groupby method:

import numpy as np

df.groupby('year_month')['Depth'].agg(np.count_nonzero)

Output:

year_month
1965-01    13
1965-02    54
1965-03    38
1965-04    33

Step 1: Create DataFrame for aggfunc

Step 2: Pandas describe DataFrame

Step 3: Pandas all aggfunc for DataFrame

Step 4: Pandas aggfunc - Count, Nunique, Size, Unique

Step 5: Pandas aggfunc - First and Last

Step 6: Pandas aggfunc - Sum, Min, Max

Step 7: Pandas aggfunc - Mean, Median, Mode

Step 7: Pandas aggfunc - STD, MAD, Var

Step 7: Pandas aggfunc - Skew, Sem, quantile

Step 8: User defined aggfunc for groupby

Step 9: Pandas aggfuncs from scipy or numpy

Resources