SQL Equivalent count(distinct) in Pandas - nunique

In this post, we will see how to count distinct values in Pandas.

The SQL's Equivalent of count(distinct) is method nunique(). It will count distinct values per group in Pandas DataFrame:

(1) Pandas method nunique() and `groupby()

df.groupby('Magnitude Type')['Date'].nunique()

(2) Pandas count distinct - nunique() + agg()

df.groupby(['Magnitude Type'])['Date'].agg(['count', 'nunique'])

The above solution will group by column 'Magnitude Type' and will count all unique values for column 'Date'.

If you need to learn more about Pandas and SQL you can visit this cheat sheet: Pandas vs SQL Cheat Sheet.

Let's cover the above in more detail. Suppose we have DataFrame like:

Date Latitude Longitude Depth Magnitude Type
01/02/1965 19.246 145.616 131.6 MW
01/04/1965 1.863 127.352 80.0 MW
01/05/1965 -20.579 -173.972 20.0 MW
01/08/1965 -59.076 -23.557 15.0 MW
01/09/1965 11.938 126.427 15.0 MW

Step 1: nunique() and groupby() - SQL's equivalent to count(distinct) in Pandas

If we like to group by a column and then get unique values for another column in Pandas you can use the method: .nunique().

The syntax is simple:

df.groupby('Magnitude Type')['Date'].nunique()

the result is similar to the SQL query:

SELECT count(distinct Date) FROM earthquakes GROUP BY `Magnitude Type`;
Magnitude Type
MB     2474
MD        5
MH        4
ML       60
MS     1316
MW     4665
MWB    2036
MWC    3458
MWR      18
MWW    1256
Name: Date, dtype: int64
× Pro Tip 1
Method value_counts will return the total values and not the unique ones.
df['Magnitude Type'].value_counts(dropna=False)

result:

MW     7722
MWC    5669
MB     3761
MWB    2458
MWW    1983
MS     1702
ML       77
MWR      26
MD        6
MH        5
NaN       3
Name: Magnitude Type, dtype: int64
× Pro Tip 2
Be careful for NaN values. They can make difference in the numbers.

Step 2: nunique() and agg() in Pandas

If we like to get all unique values in each group plus information like:

  • min
  • max
  • count
  • mean

You can modify the upper query a bit:

df.groupby(['Magnitude Type'])['Date'].agg(['count', 'nunique'])

This will result into:

count nunique min
Magnitude Type
MB 3761 2474 01/01/1975
MD 6 5 02/28/2001
MH 5 4 02/09/1971
ML 77 60 01/03/1976
MS 1702 1316 01/01/1973
MW 7722 4665 01/01/1967
MWB 2458 2036 01/01/1995
MWC 5669 3458 01/01/1997
MWR 26 18 02/13/2012
MWW 1983 1256 01/01/2011

As you can see for Magnitude Type = MH there are 4 unique Dates but 5 records:

Date Latitude Longitude Depth Magnitude Type
1848 02/09/1971 34.416000 -118.370000 6.000 MH
1849 02/09/1971 34.416000 -118.370000 6.000 MH
9664 10/18/1989 37.036167 -121.879833 17.214 MH
10584 08/17/1991 41.679000 -125.856000 1.303 MH
10869 04/25/1992 40.335333 -124.228667 9.856 MH

Because there are 2 records for date: 02/09/1971.

Step 3: Pandas count distinct multiple columns

To count number of unique values for multiple columns in Pandas there are two options:

  • Combine agg() + nunique()
  • Pandas method crosstab()

Count distinct with agg() + nunique()

First one is using the same approach as above:

df.groupby('Magnitude Type').agg({'Date': ['nunique', 'count'],
                                      'Depth': ['nunique', 'count']})

result:

Date Depth
nunique count nunique count
Magnitude Type
MB 2474 3761 911 3761
MD 5 6 6 6
MH 4 5 4 5
ML 60 77 67 77
MS 1316 1702 206 1702

Count distinct in Pandas with crosstab()

Usually Pandas provide multiple ways of achieving something. In this example we are going to use the method crosstab.

It might be slower than the other but you may have additional useful information for more insights.

So first let's do a simple demo of crosstab:

pd.crosstab(df['Magnitude Type'], df['Date'])

This will give a table which is showing information for each Date and Magnitude Type:

Date 01/01/1967 01/01/1969 01/01/1970 01/01/1971 01/01/1972
Magnitude Type
MW 2 1 1 1 1
MWB 0 0 0 0 0
MWC 0 0 0 0 0
MWR 0 0 0 0 0
MWW 0 0 0 0 0

And then we can summarize the information into unique values by using .ne(0).sum(1) :

pd.crosstab(df['Magnitude Type'], df['Date']).ne(0).sum(1)

result:

Magnitude Type
MB     2474
MD        5
MH        4
ML       60
MS     1316
MW     4665
MWB    2036
MWC    3458
MWR      18
MWW    1256
dtype: int64
× How does it work?
Method ne(0) will convert anything different from 0 to True and 0 to False. Then we are going to sum across rows.

Step 4: Count distinct with condition in Pandas

Finally we have an option of using a lambda in order to count the unique values with condition.

This will be useful if you like to apply conditions on the count - for example excluding some records.

df.groupby('Magnitude Type')['Date'].apply(lambda x: x.unique().shape[0])

For example example excluding groups smaller than 1000:

df.groupby('Magnitude Type')['Date'].apply(lambda x: x.unique().shape[0] if x.unique().shape[0] > 1000 else None).dropna()

Resources