SQL Equivalent count(distinct) in Pandas

In this post, we will see how to count distinct values in Pandas.

The SQL's Equivalent of count(distinct) is method nunique(). It will count distinct values per group in Pandas DataFrame:

(1) Pandas method nunique() and `groupby()

df.groupby('Magnitude Type')['Date'].nunique()

(2) Pandas count distinct - nunique() + agg()

df.groupby(['Magnitude Type'])['Date'].agg(['count', 'nunique'])

The above solution will group by column 'Magnitude Type' and will count all unique values for column 'Date'.

If you need to learn more about Pandas and SQL you can visit this cheat sheet: Pandas vs SQL Cheat Sheet.

Let's cover the above in more detail. Suppose we have DataFrame like:

Date	Latitude	Longitude	Depth	Magnitude Type
01/02/1965	19.246	145.616	131.6	MW
01/04/1965	1.863	127.352	80.0	MW
01/05/1965	-20.579	-173.972	20.0	MW
01/08/1965	-59.076	-23.557	15.0	MW
01/09/1965	11.938	126.427	15.0	MW

Step 1: `nunique()` and `groupby()` - SQL's equivalent to `count(distinct)` in Pandas

If we like to group by a column and then get unique values for another column in Pandas you can use the method: .nunique().

The syntax is simple:

df.groupby('Magnitude Type')['Date'].nunique()

the result is similar to the SQL query:

SELECT count(distinct Date) FROM earthquakes GROUP BY `Magnitude Type`;

Magnitude Type
MB     2474
MD        5
MH        4
ML       60
MS     1316
MW     4665
MWB    2036
MWC    3458
MWR      18
MWW    1256
Name: Date, dtype: int64

× Pro Tip 1
Method value_counts will return the total values and not the unique ones.

df['Magnitude Type'].value_counts(dropna=False)

result:

MW     7722
MWC    5669
MB     3761
MWB    2458
MWW    1983
MS     1702
ML       77
MWR      26
MD        6
MH        5
NaN       3
Name: Magnitude Type, dtype: int64

× Pro Tip 2
Be careful for NaN values. They can make difference in the numbers.

Step 2: `nunique()` and `agg()` in Pandas

If we like to get all unique values in each group plus information like:

min
max
count
mean

You can modify the upper query a bit:

df.groupby(['Magnitude Type'])['Date'].agg(['count', 'nunique'])

This will result into:

	count	nunique	min
Magnitude Type
MB	3761	2474	01/01/1975
MD	6	5	02/28/2001
MH	5	4	02/09/1971
ML	77	60	01/03/1976
MS	1702	1316	01/01/1973
MW	7722	4665	01/01/1967
MWB	2458	2036	01/01/1995
MWC	5669	3458	01/01/1997
MWR	26	18	02/13/2012
MWW	1983	1256	01/01/2011

As you can see for Magnitude Type = MH there are 4 unique Dates but 5 records:

	Date	Latitude	Longitude	Depth	Magnitude Type
1848	02/09/1971	34.416000	-118.370000	6.000	MH
1849	02/09/1971	34.416000	-118.370000	6.000	MH
9664	10/18/1989	37.036167	-121.879833	17.214	MH
10584	08/17/1991	41.679000	-125.856000	1.303	MH
10869	04/25/1992	40.335333	-124.228667	9.856	MH

Because there are 2 records for date: 02/09/1971.

Step 3: Pandas count distinct multiple columns

To count number of unique values for multiple columns in Pandas there are two options:

Combine agg() + nunique()
Pandas method crosstab()

Count distinct with agg() + nunique()

First one is using the same approach as above:

df.groupby('Magnitude Type').agg({'Date': ['nunique', 'count'],
                                      'Depth': ['nunique', 'count']})

result:

	Date		Depth
	nunique	count	nunique	count
Magnitude Type
MB	2474	3761	911	3761
MD	5	6	6	6
MH	4	5	4	5
ML	60	77	67	77
MS	1316	1702	206	1702

Count distinct in Pandas with crosstab()

Usually Pandas provide multiple ways of achieving something. In this example we are going to use the method crosstab.

It might be slower than the other but you may have additional useful information for more insights.

So first let's do a simple demo of crosstab:

pd.crosstab(df['Magnitude Type'], df['Date'])

This will give a table which is showing information for each Date and Magnitude Type:

Date	01/01/1967	01/01/1969	01/01/1970	01/01/1971	01/01/1972
Magnitude Type
MW	2	1	1	1	1
MWB	0	0	0	0	0
MWC	0	0	0	0	0
MWR	0	0	0	0	0
MWW	0	0	0	0	0

And then we can summarize the information into unique values by using .ne(0).sum(1) :

pd.crosstab(df['Magnitude Type'], df['Date']).ne(0).sum(1)

result:

Magnitude Type
MB     2474
MD        5
MH        4
ML       60
MS     1316
MW     4665
MWB    2036
MWC    3458
MWR      18
MWW    1256
dtype: int64

× How does it work?
Method ne(0) will convert anything different from 0 to True and 0 to False. Then we are going to sum across rows.

Step 4: Count distinct with condition in Pandas

Finally we have an option of using a lambda in order to count the unique values with condition.

This will be useful if you like to apply conditions on the count - for example excluding some records.

df.groupby('Magnitude Type')['Date'].apply(lambda x: x.unique().shape[0])

For example example excluding groups smaller than 1000:

df.groupby('Magnitude Type')['Date'].apply(lambda x: x.unique().shape[0] if x.unique().shape[0] > 1000 else None).dropna()

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

Step 1: nunique() and groupby() - SQL's equivalent to count(distinct) in Pandas

Step 2: nunique() and agg() in Pandas

Step 3: Pandas count distinct multiple columns

Count distinct with agg() + nunique()

Count distinct in Pandas with crosstab()

Step 4: Count distinct with condition in Pandas

Resources

Step 1: `nunique()` and `groupby()` - SQL's equivalent to `count(distinct)` in Pandas

Step 2: `nunique()` and `agg()` in Pandas