GroupBy and Count Unique Rows in Pandas

In this short guide, we'll see how to use groupby() on several columns and count unique rows in Pandas.

Several examples will explain how to group by and apply statistical functions like: sum, count, mean etc. Often there is a need to group by a column and then get sum() and count().

× Pro Tip 1
It's recommended to use method df.value_counts for counting the size of groups in Pandas. It's a bit faster and support parameter `dropna` since Pandas 1.3

× Pro Tip 2
Sorting the results of groupby/count or value_counts will slow the process with roughly 20%

× Warning
Be careful of counting NaN values. They can change the expected results and counts. For value_counts use parameter dropna=True to count with NaN values.

To start, here is the syntax that we may apply in order to combine groupby and count in Pandas:

df.groupby(['publication', 'date_m'])['url'].count()

The DataFrame used in this article is available from Kaggle.

If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame

url	date	publication
https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92	2019-05-30	Towards Data Science
https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8	2019-05-30	Towards Data Science
https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129	2019-05-30	Towards Data Science
https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab	2019-05-30	Towards Data Science
https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110	2019-05-30	Towards Data Science

Step 1: Use groupby() and count() in Pandas

Let say that we would like to combine groupby and then get unique count per group.

In this first step we will count the number of unique publications per month from the DataFrame above.

First we need to convert date to month format - YYYY-MM with(learn more about it - Extract Month and Year from DateTime column in Pandas

df['date'] = pd.to_datetime(df['date'])
df['date_m'] = df['date'].dt.to_period('M')

and then we can group by two columns - 'publication', 'date_m' and count the URLs per each group:

df.groupby(['publication', 'date_m'])['url'].count()

this will result into:

publication    date_m
Better Humans  2019-03     5
               2019-04     4
               2019-05     4
               2019-06     1
               2019-07     3
                          ..
UX Collective  2019-08    24
               2019-09    33
               2019-10    86
               2019-11    28
               2019-12    46

An important note is that will compute the count of each group, excluding missing values.

If we like to count distinct values in Pandas - nunique() - check the linked article.

Step 2: groupby(), count() and sum() in Pandas

In Pandas method groupby will return object which is: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f26bd45da20> - this can be checked by df.groupby(['publication', 'date_m']).

This kind of object has an agg function which can take a list of aggregation methods.

This is very useful if we need to check multiple statistics methods - sum(), count(), mean() per group.

So if we like to group by two columns publication and date_m - then to check next aggregation functions - mean, sum, and count we can use:

df.groupby(['publication', 'date_m']).agg(['mean', 'count', 'sum'])

we will get a MultiIndex as result:

		reading_time
		mean	count	sum
publication	date_m
Better Humans	2019-03	12.000000	5	60
	2019-04	7.500000	4	30
	2019-05	25.000000	4	100
	2019-06	5.000000	1	5
	2019-07	22.333333	3	67
	2019-09	12.000000	3	36
	2019-10	9.750000	4	39
	2019-11	9.000000	2	18
	2019-12	9.500000	2	19
Better Marketing	2019-03	9.600000	5	48
	2019-04	5.500000	4	22
	2019-05	6.529412	34	222
	2019-06	7.000000	15	105
	2019-07	6.106383	47	287
	2019-08	5.880000	25	147

Step 3: groupby() + count() + sort() = value_counts()

In the latest versions of pandas (>= 1.1) you can use value_counts in order to achieve behavior similar to groupby and count.

df.value_counts(['publication', 'date_m'])

will give us:

publication    date_m
The Startup    2019-05    656
               2019-10    477
               2019-07    401
               2019-12    310
               2019-06    286
                         ...
Better Humans  2019-07      3
UX Collective  2019-01      2
Better Humans  2019-12      2
               2019-11      2
               2019-06      1
Length: 79, dtype: int64

Note the code above is equivalent to:

df.groupby(['publication', 'date_m'])['url'].count().sort_values(ascending=False)

Note:

Which solution is better depends on the data and the context. In the end of the post there is a performance comparison of both methods.

Step 4: Combine groupby() and size()

Alternative solution is to use groupby and size in order to count the elements per group in Pandas. The example below demonstrate the usage of size() + groupby():

df.groupby(['publication', 'date_m']).size()

result is a Pandas series like:

publication    date_m
Better Humans  2019-03     5
               2019-04     4
               2019-05     4
               2019-06     1
               2019-07     3
                          ..
UX Collective  2019-08    24
               2019-09    33
               2019-10    86
               2019-11    28
               2019-12    46
Length: 79, dtype: int64

Step 5: Combine groupby() and describe()

The final option is to use the method describe(). It will return statistical information which can be extremely useful like:

df.groupby(['publication', 'date_m'])['reading_time'].describe()

which will return additional stats like:

count
mean
std
percentile etc

		count	mean	std	min	25%	50%	75%	max
publication	date_m
Better Humans	2019-03	5.0	12.000000	5.196152	3.0	13.00	13.0	15.00	16.0
	2019-04	4.0	7.500000	5.196152	2.0	4.25	7.0	10.25	14.0
	2019-05	4.0	25.000000	7.958224	17.0	21.50	23.5	27.00	36.0
	2019-06	1.0	5.000000	NaN	5.0	5.00	5.0	5.00	5.0
	2019-07	3.0	22.333333	15.307950	13.0	13.50	14.0	27.00	40.0
	2019-09	3.0	12.000000	4.582576	8.0	9.50	11.0	14.00	17.0
	2019-10	4.0	9.750000	5.123475	3.0	7.50	10.5	12.75	15.0
	2019-11	2.0	9.000000	1.414214	8.0	8.50	9.0	9.50	10.0
	2019-12	2.0	9.500000	2.121320	8.0	8.75	9.5	10.25	11.0
Better Marketing	2019-03	5.0	9.600000	8.080842	5.0	6.00	6.0	7.00	24.0
	2019-04	4.0	5.500000	1.000000	5.0	5.00	5.0	5.50	7.0
	2019-05	34.0	6.529412	4.039467	2.0	5.00	5.0	7.00	20.0
	2019-06	15.0	7.000000	2.725541	3.0	5.50	6.0	8.50	13.0
	2019-07	47.0	6.106383	2.664044	3.0	4.00	5.0	7.00	13.0
	2019-08	25.0	5.880000	2.818392	3.0	4.00	5.0	6.00	15.0

Performance - groupby() & count() vs value_counts()

Finally lets do a quick comparison of performance between:

groupby() and count()
value_counts()

The next example will return equivalent results:

df.groupby(['publication', 'date_m'])['publication'].count()

df.value_counts(subset=['publication', 'date_m'], sort=False)

Below you can find the timings:

df.groupby

3.59 ms ± 24.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df.value_counts

1.05 ms ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df.value_counts with sort parameter

1.2 ms ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Conclusion and Resources

In this post we covered how to use groupby() and count unique rows in Pandas. How to sort results of groupby() and count().

Also we covered applying groupby() on multiple columns with multiple agg methods like sum(), min(), min().

Finally we saw how to use value_counts() in order to count unique values and sort the results.

The resources mentioned below will be extremely useful for further analysis:

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

Step 1: Use groupby() and count() in Pandas

Step 2: groupby(), count() and sum() in Pandas

Step 3: groupby() + count() + sort() = value_counts()

Step 4: Combine groupby() and size()

Step 5: Combine groupby() and describe()

Performance - groupby() & count() vs value_counts()

Conclusion and Resources