To group by multiple columns in Pandas DataFrame we can use method groupby()?

We will cover:

  • group by multiple columns
  • group by several statistical functions
  • named group by
  • Series vs DataFrame group by

To group by multiple columns and using several statistical functions we are going to use next functions:

  • groupby()
  • agg()
  • 'mean', 'count', 'sum'
df.groupby(['publication', 'date_m']).agg(['mean', 'count', 'sum'])

Let's see all the steps in order to find the statistics for each group.

Step 1: Create sample DataFrame

Let's use the following Kaggle Dataset:

title date publication url
A Round of Applause for Algorithms 2019-02-18 Towards Data Science https://towardsdatascience.com/a-round-of-applause-for-algorithms-3322f6aa1f8e
The Power of Thank-You Notes: A Simple Way to Make Your Team Happier (And More Productive) 2019-10-21 The Startup https://medium.com/swlh/the-power-of-thank-you-notes-a-simple-way-to-make-your-team-happier-and-more-productive-fc6f2a575de2
The Struggle of Modern Day Intrusion Detection Systems 2019-09-17 Towards Data Science https://towardsdatascience.com/the-struggle-of-modern-day-intrusion-detection-systems-50481a6b53c6
A Meditation on Stringing Words Together: The National’s “Roman Holiday” 2019-05-22 The Startup https://medium.com/swlh/a-meditation-on-stringing-words-together-the-nationals-roman-holiday-7acbfef7cc02
A Sense of Purpose Enables Better Human-Robot Collaboration 2019-10-14 Towards Data Science https://towardsdatascience.com/a-sense-of-purpose-enables-better-human-robot-collaboration-fbe64d0ae913

You can find the sample data from the repository of the notebook or use the link below to download it.

To learn more about reading Kaggle data with Python and Pandas: How to Search and Download Kaggle Dataset to Pandas DataFrame

Step 2: Group by multiple columns

First lets see how to group by a single column in a Pandas DataFrame you can use the next syntax:

df.groupby(['publication'])

In order to group by multiple columns you need to use the next syntax:

df.groupby(['publication', 'date_m'])

The columns should be provided as a list to the groupby method.

Step 3: GroupBy SeriesGroupBy vs DataFrameGroupBy

The object returned after the groupby of multiple columns depends on the usage of the groups. Let's check it by examples:

df.groupby('publication')

returns:

pandas.core.groupby.generic.DataFrameGroupBy
df.groupby(['publication'])['url']

returns:

pandas.core.groupby.generic.SeriesGroupBy
df.groupby(['publication'])[['url', 'date']]

returns:

pandas.core.groupby.generic.DataFrameGroupBy

If you use a single column after the groupby you will get SeriesGroupBy otherwise you will have DataFrameGroupBy.

Step 4: Apply multiple statistical functions

Applying multiple aggregation functions to a groupby is done by method: agg.

Note: that another function aggregate exists which and agg is an alias for it.

The functions can be passed as a list.

The available aggregation functions for group by in Pandas are:

  • count – non-null values
  • min / `max – minimum/maximum
  • std – standard deviation
  • sum – sum of values
  • mean / median – mean/median
  • mode
  • var

Below you can find the syntax in order to apply multiple statistical functions on multiple pandas columns while grouping by multiple columns:

df.groupby(['publication', 'date_m'])[['claps', 'reading_time']].agg(['mean', 'count', 'sum'])

result:

claps reading_time
mean count sum mean count sum
publication date_m
Better Humans 2019-03 1710.800000 5 8554 12.000000 5 60
2019-04 3942.250000 4 15769 7.500000 4 30
2019-05 1187.000000 4 4748 25.000000 4 100
2019-06 9700.000000 1 9700 5.000000 1 5
2019-07 1183.333333 3 3550 22.333333 3 67

If you don't provide columns after the grouper than all numeric/datetime columns will be returned:

df.groupby(['publication', 'date_m']).agg(['mean'])

all columns on which agg functions can be applied are returned:

id claps reading_time date
mean mean mean mean
publication date_m
Better Humans 2019-03 5138.200000 1710.800000 12.000000 2019-03-17 04:48:00
2019-04 2308.250000 3942.250000 7.500000 2019-04-24 18:00:00
2019-05 3382.000000 1187.000000 25.000000 2019-05-14 06:00:00
2019-06 6326.000000 9700.000000 5.000000 2019-06-27 00:00:00
2019-07 1189.333333 1183.333333 22.333333 2019-07-24 00:00:00

Note that the above results have MultiIndex. In order to work with MultiIndex you can check those articles:

Step 5: Pandas groupby and named aggregations

What if you like to group by multiple columns with several aggregation functions and would like to have - named aggregations.

In other words you need to get the next information:

  • id - mean of this column
  • claps - mean, count and range

You can also name the columns to meet your needs. Finally you can get them without MultiIndex. This is possible by next syntax:

df.groupby('publication').agg(
             id_mean = ('id', 'mean'),
             claps_mean = ('claps', 'mean'),
             claps_count = ('claps', 'count'),
             claps_range = ('claps', lambda x: x.max() - x.min()))

this will return:

id_mean claps_mean claps_count claps_range
publication
Better Humans 3004.285714 1827.785714 28 9660
Better Marketing 2954.578512 829.347107 242 22971
Data Driven Investor 3491.115681 95.034704 778 4100
The Startup 3238.427162 303.403815 3041 38000
The Writing Cooperative 3539.476427 372.744417 403 12600

Resources