In this tutorial, we'll learn how to select non null values in Pandas. You can easily find also the number of non NaN or NaN values in column or multiple columns.
In the next section we will cover all the steps in a real world example.
To learn more about the NaN values in Pandas you can check also:
How to Get First Non-NaN Value Per Row in Pandas.
Setup
Let's start with sample data and the problem for this tutorial.
DataFrame with NaN values
We have different items on each row. The columns represent different topics. One item can have zero or multiple topics
Our goal is to get the first or last topic for each item.
import pandas as pd
details = {
'topic_1': {'item_1': 1, 'item_2': 0, 'item_3': 0, 'item_4': 0, 'item_5': 1},
'topic_2': {'item_1': 0, 'item_2': 1, 'item_3': 1, 'item_4': 0, 'item_5': 0},
'topic_3': {'item_1': 1, 'item_2': 0, 'item_3': 0, 'item_4': 0, 'item_5': 0},
'topic_4': {'item_1': 0, 'item_2': 0, 'item_3': 1, 'item_4': 0, 'item_5': 0}
}
df = pd.DataFrame(details)
result:
topic_1 | topic_2 | topic_3 | topic_4 | |
---|---|---|---|---|
item_1 | 1 | 0 | 1 | 0 |
item_2 | 0 | 1 | 0 | 0 |
item_3 | 0 | 1 | 0 | 1 |
item_4 | 0 | 0 | 0 | 0 |
item_5 | 1 | 0 | 0 | 0 |
Intro - Get Name of First Non NaN Column
Can you extract a topic for each row?
The problem is to identify the first or last topic from multiple for each item.
Expected result:
topic_1 | topic_2 | topic_3 | topic_4 | category | |
---|---|---|---|---|---|
item_1 | 1 | 0 | 1 | 0 | topic_3 |
item_2 | 0 | 1 | 0 | 0 | topic_2 |
item_3 | 0 | 1 | 0 | 1 | topic_4 |
item_4 | 0 | 0 | 0 | 0 | other |
item_5 | 1 | 0 | 0 | 0 | topic_1 |
Step 1: Get Column name for First non NaN Column
Let's start by getting the first topic per row/item from a list of columns.
First we will identify a list of columns which are going to be used.
We have two options:
- get the columns by
df.columns
- or write them explicitly -
categories = reversed(['topic_1', 'topic_2', 'topic_3', 'topic_4'])
We are going to use simple algorithm to get values from the multiple columns:
- create new column with default value of 'other'
- iterate over all rows
- iterate over all categories
- map the values to the topic name
- fill the missing values from the last values of the new column - this step is needed in order to replace non matched values which will be assigned with NaNs.
The code below is getting the first non NaN values from each column:
categories = reversed(['topic_1', 'topic_2', 'topic_3', 'topic_4'])
df['category'] = 'other'
for ix, row in df.iterrows():
for cat in categories:
d = {1: cat}
df['category'] = df[cat].map(d).fillna(df['category'])
To get the last non NaN value we can change the list order by removing reversed
.
Step 2: Get columns with NaN values Pandas - explanation
In this step we will describe how the main part of the code is working.
The main part has two important functions:
- pandas.Series.map - maps a dict to a column and returns all found values. Otherwise returns NaN. You can learn more in this article: How to Map Column with Dictionary in Pandas
- pandas.DataFrame.fillna - fills all missing values with the provided value or iterable(in this case)
So the idea is to map all values from the current column and with next columns to fill the new values while keeping the ones which are set.
Conclusion
In this short article we saw how to combine multiple Pandas functions in order to achieve complex logic - get first/last column from multiple which has non NaN value.
Using the described technique you can combine multiple columns to a single one based on their values.