In this short guide, I'll show you how to group by several columns and count in Python and Pandas.

Several examples will explain how to group and apply statistical functions like: sum, count, mean etc.

× Pro Tip 1
It's recommended to use method df.value_counts for counting the size of groups in Pandas. It's a bit faster and support parameter `dropna` since Pandas 1.3
× Pro Tip 2
Sorting the results of groupby/count or value_counts will slow the process with roughly 20%
× Warning
Be careful for counting NaN values. They can change the expected results and counts. For value_counts use parameter dropna=True to count with NaN values.

To start, here is the syntax that you may apply in order groupby and count in Pandas DataFrame:

df.groupby(['publication', 'date_m'])['url'].count()

The DataFrame used in this article is available from Kaggle.

If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame

url date publication
https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92 2019-05-30 Towards Data Science
https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8 2019-05-30 Towards Data Science
https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129 2019-05-30 Towards Data Science
https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab 2019-05-30 Towards Data Science
https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110 2019-05-30 Towards Data Science

Option 1: GroupBy and Count in Pandas

Let say that you would like to combine groupby and then get some statistics for the groups.

In this first step we will count the number of publications per month from the DataFrame above.

First we need to convert date to month format - YYYY-MM with(learn more about it - Extract Month and Year from DateTime column in Pandas

df['date'] = pd.to_datetime(df['date'])
df['date_m'] = df['date'].dt.to_period('M')

and then we can group by 'publication', 'date_m' and count the URLs per each group:

df.groupby(['publication', 'date_m'])['url'].count()

this will result into:

publication    date_m
Better Humans  2019-03     5
               2019-04     4
               2019-05     4
               2019-06     1
               2019-07     3
                          ..
UX Collective  2019-08    24
               2019-09    33
               2019-10    86
               2019-11    28
               2019-12    46

An important note is that will compute the count of each group, excluding missing values.

Option 2: GroupBy and Aggregate functions in Pandas

In Pandas method groupby will return object which is: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f26bd45da20> - this can be checked by df.groupby(['publication', 'date_m']).

This kind of object has an agg function which can take a list of aggregation methods.

This is very useful if you need to check several statistics for groups. So if we like to group by publication and date_m - then to check next aggregation functions - mean, sum, and count we can use:

df.groupby(['publication', 'date_m']).agg(['mean', 'count', 'sum'])

we will get a MultiIndex like:

reading_time
mean count sum
publication date_m
Better Humans 2019-03 12.000000 5 60
2019-04 7.500000 4 30
2019-05 25.000000 4 100
2019-06 5.000000 1 5
2019-07 22.333333 3 67
2019-09 12.000000 3 36
2019-10 9.750000 4 39
2019-11 9.000000 2 18
2019-12 9.500000 2 19
Better Marketing 2019-03 9.600000 5 48
2019-04 5.500000 4 22
2019-05 6.529412 34 222
2019-06 7.000000 15 105
2019-07 6.106383 47 287
2019-08 5.880000 25 147

Option 3: GroupBy, Count and value_counts in Pandas

In the latest versions of pandas (>= 1.1) you can use value_counts in order to achieve behaviour similar to groupby and count.

df.value_counts(['publication', 'date_m'])

will give use:

publication    date_m
The Startup    2019-05    656
               2019-10    477
               2019-07    401
               2019-12    310
               2019-06    286
                         ...
Better Humans  2019-07      3
UX Collective  2019-01      2
Better Humans  2019-12      2
               2019-11      2
               2019-06      1
Length: 79, dtype: int64

Note the code above is equivalent to:

df.groupby(['publication', 'date_m'])['url'].count().sort_values(ascending=False)

Option 4: GroupBy and Count + Size in Pandas

Alternative solution is to use groupby and size in order to count the elements per group in Pandas. The example below demonstrate the usage of size:

df.groupby(['publication', 'date_m']).size()

result is a Pandas series like:

publication    date_m
Better Humans  2019-03     5
               2019-04     4
               2019-05     4
               2019-06     1
               2019-07     3
                          ..
UX Collective  2019-08    24
               2019-09    33
               2019-10    86
               2019-11    28
               2019-12    46
Length: 79, dtype: int64

Option 5: GroupBy and Count + Size in Pandas

The final option is to use the method describe. It will return statistical information which can be extremely useful like:

df.groupby(['publication', 'date_m'])['reading_time'].describe()

which will return stats like:

  • count
  • mean
  • std etc
count mean std min 25% 50% 75% max
publication date_m
Better Humans 2019-03 5.0 12.000000 5.196152 3.0 13.00 13.0 15.00 16.0
2019-04 4.0 7.500000 5.196152 2.0 4.25 7.0 10.25 14.0
2019-05 4.0 25.000000 7.958224 17.0 21.50 23.5 27.00 36.0
2019-06 1.0 5.000000 NaN 5.0 5.00 5.0 5.00 5.0
2019-07 3.0 22.333333 15.307950 13.0 13.50 14.0 27.00 40.0
2019-09 3.0 12.000000 4.582576 8.0 9.50 11.0 14.00 17.0
2019-10 4.0 9.750000 5.123475 3.0 7.50 10.5 12.75 15.0
2019-11 2.0 9.000000 1.414214 8.0 8.50 9.0 9.50 10.0
2019-12 2.0 9.500000 2.121320 8.0 8.75 9.5 10.25 11.0
Better Marketing 2019-03 5.0 9.600000 8.080842 5.0 6.00 6.0 7.00 24.0
2019-04 4.0 5.500000 1.000000 5.0 5.00 5.0 5.50 7.0
2019-05 34.0 6.529412 4.039467 2.0 5.00 5.0 7.00 20.0
2019-06 15.0 7.000000 2.725541 3.0 5.50 6.0 8.50 13.0
2019-07 47.0 6.106383 2.664044 3.0 4.00 5.0 7.00 13.0
2019-08 25.0 5.880000 2.818392 3.0 4.00 5.0 6.00 15.0

Performance comparison of groupby & count vs value_counts

Finally lets do a quick comparison in terms of performance between:

  • groupby and count in Pandas
  • value_counts

The next example will return equivalent results:

df.groupby(['publication', 'date_m'])['publication'].count()
df.value_counts(subset=['publication', 'date_m'], sort=False)

Below you can find the timings:

df.groupby

3.59 ms ± 24.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df.value_counts

1.05 ms ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df.value_counts with sort parameter

1.2 ms ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Resources