In this short guide, we'll see how to use groupby() on several columns and count unique rows in Pandas.
Several examples will explain how to group by and apply statistical functions like: sum
, count
, mean
etc. Often there is a need to group by a column and then get sum()
and count()
.
It's recommended to use method df.value_counts for counting the size of groups in Pandas. It's a bit faster and support parameter `dropna` since Pandas 1.3
Sorting the results of groupby/count or value_counts will slow the process with roughly 20%
Be careful of counting NaN values. They can change the expected results and counts. For value_counts use parameter dropna=True to count with NaN values.
To start, here is the syntax that we may apply in order to combine groupby
and count
in Pandas:
df.groupby(['publication', 'date_m'])['url'].count()
The DataFrame used in this article is available from Kaggle.
If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame
url | date | publication |
---|---|---|
https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92 | 2019-05-30 | Towards Data Science |
https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8 | 2019-05-30 | Towards Data Science |
https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129 | 2019-05-30 | Towards Data Science |
https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab | 2019-05-30 | Towards Data Science |
https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110 | 2019-05-30 | Towards Data Science |
Step 1: Use groupby() and count() in Pandas
Let say that we would like to combine groupby
and then get unique count per group.
In this first step we will count the number of unique publications per month from the DataFrame above.
First we need to convert date to month format - YYYY-MM with(learn more about it - Extract Month and Year from DateTime column in Pandas
df['date'] = pd.to_datetime(df['date'])
df['date_m'] = df['date'].dt.to_period('M')
and then we can group by two columns - 'publication', 'date_m'
and count the URLs per each group:
df.groupby(['publication', 'date_m'])['url'].count()
this will result into:
publication date_m
Better Humans 2019-03 5
2019-04 4
2019-05 4
2019-06 1
2019-07 3
..
UX Collective 2019-08 24
2019-09 33
2019-10 86
2019-11 28
2019-12 46
An important note is that will compute the count of each group, excluding missing values.
If we like to count distinct values in Pandas - nunique() - check the linked article.
Step 2: groupby(), count() and sum() in Pandas
In Pandas method groupby
will return object which is: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f26bd45da20>
- this can be checked by df.groupby(['publication', 'date_m'])
.
This kind of object has an agg
function which can take a list of aggregation methods.
This is very useful if we need to check multiple statistics methods - sum(), count(), mean()
per group.
So if we like to group by two columns publication
and date_m
- then to check next aggregation functions - mean
, sum
, and count
we can use:
df.groupby(['publication', 'date_m']).agg(['mean', 'count', 'sum'])
we will get a MultiIndex as result:
reading_time | ||||
---|---|---|---|---|
mean | count | sum | ||
publication | date_m | |||
Better Humans | 2019-03 | 12.000000 | 5 | 60 |
2019-04 | 7.500000 | 4 | 30 | |
2019-05 | 25.000000 | 4 | 100 | |
2019-06 | 5.000000 | 1 | 5 | |
2019-07 | 22.333333 | 3 | 67 | |
2019-09 | 12.000000 | 3 | 36 | |
2019-10 | 9.750000 | 4 | 39 | |
2019-11 | 9.000000 | 2 | 18 | |
2019-12 | 9.500000 | 2 | 19 | |
Better Marketing | 2019-03 | 9.600000 | 5 | 48 |
2019-04 | 5.500000 | 4 | 22 | |
2019-05 | 6.529412 | 34 | 222 | |
2019-06 | 7.000000 | 15 | 105 | |
2019-07 | 6.106383 | 47 | 287 | |
2019-08 | 5.880000 | 25 | 147 |
Step 3: groupby() + count() + sort() = value_counts()
In the latest versions of pandas (>= 1.1) you can use value_counts
in order to achieve behavior similar to groupby
and count
.
df.value_counts(['publication', 'date_m'])
will give us:
publication date_m
The Startup 2019-05 656
2019-10 477
2019-07 401
2019-12 310
2019-06 286
...
Better Humans 2019-07 3
UX Collective 2019-01 2
Better Humans 2019-12 2
2019-11 2
2019-06 1
Length: 79, dtype: int64
Note the code above is equivalent to:
df.groupby(['publication', 'date_m'])['url'].count().sort_values(ascending=False)
Which solution is better depends on the data and the context. In the end of the post there is a performance comparison of both methods.
Step 4: Combine groupby() and size()
Alternative solution is to use groupby
and size
in order to count the elements per group in Pandas. The example below demonstrate the usage of size()
+ groupby()
:
df.groupby(['publication', 'date_m']).size()
result is a Pandas series like:
publication date_m
Better Humans 2019-03 5
2019-04 4
2019-05 4
2019-06 1
2019-07 3
..
UX Collective 2019-08 24
2019-09 33
2019-10 86
2019-11 28
2019-12 46
Length: 79, dtype: int64
Step 5: Combine groupby() and describe()
The final option is to use the method describe()
. It will return statistical information which can be extremely useful like:
df.groupby(['publication', 'date_m'])['reading_time'].describe()
which will return additional stats like:
count
mean
std
- percentile etc
count | mean | std | min | 25% | 50% | 75% | max | ||
---|---|---|---|---|---|---|---|---|---|
publication | date_m | ||||||||
Better Humans | 2019-03 | 5.0 | 12.000000 | 5.196152 | 3.0 | 13.00 | 13.0 | 15.00 | 16.0 |
2019-04 | 4.0 | 7.500000 | 5.196152 | 2.0 | 4.25 | 7.0 | 10.25 | 14.0 | |
2019-05 | 4.0 | 25.000000 | 7.958224 | 17.0 | 21.50 | 23.5 | 27.00 | 36.0 | |
2019-06 | 1.0 | 5.000000 | NaN | 5.0 | 5.00 | 5.0 | 5.00 | 5.0 | |
2019-07 | 3.0 | 22.333333 | 15.307950 | 13.0 | 13.50 | 14.0 | 27.00 | 40.0 | |
2019-09 | 3.0 | 12.000000 | 4.582576 | 8.0 | 9.50 | 11.0 | 14.00 | 17.0 | |
2019-10 | 4.0 | 9.750000 | 5.123475 | 3.0 | 7.50 | 10.5 | 12.75 | 15.0 | |
2019-11 | 2.0 | 9.000000 | 1.414214 | 8.0 | 8.50 | 9.0 | 9.50 | 10.0 | |
2019-12 | 2.0 | 9.500000 | 2.121320 | 8.0 | 8.75 | 9.5 | 10.25 | 11.0 | |
Better Marketing | 2019-03 | 5.0 | 9.600000 | 8.080842 | 5.0 | 6.00 | 6.0 | 7.00 | 24.0 |
2019-04 | 4.0 | 5.500000 | 1.000000 | 5.0 | 5.00 | 5.0 | 5.50 | 7.0 | |
2019-05 | 34.0 | 6.529412 | 4.039467 | 2.0 | 5.00 | 5.0 | 7.00 | 20.0 | |
2019-06 | 15.0 | 7.000000 | 2.725541 | 3.0 | 5.50 | 6.0 | 8.50 | 13.0 | |
2019-07 | 47.0 | 6.106383 | 2.664044 | 3.0 | 4.00 | 5.0 | 7.00 | 13.0 | |
2019-08 | 25.0 | 5.880000 | 2.818392 | 3.0 | 4.00 | 5.0 | 6.00 | 15.0 |
Performance - groupby() & count() vs value_counts()
Finally lets do a quick comparison of performance between:
groupby()
andcount()
value_counts()
The next example will return equivalent results:
df.groupby(['publication', 'date_m'])['publication'].count()
df.value_counts(subset=['publication', 'date_m'], sort=False)
Below you can find the timings:
df.groupby
3.59 ms ± 24.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.value_counts
1.05 ms ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.value_counts
with sort
parameter
1.2 ms ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Conclusion and Resources
In this post we covered how to use groupby() and count unique rows in Pandas. How to sort results of groupby()
and count()
.
Also we covered applying groupby()
on multiple columns with multiple agg methods like sum()
, min()
, min()
.
Finally we saw how to use value_counts()
in order to count unique values and sort the results.
The resources mentioned below will be extremely useful for further analysis: