How to Replace Values in Column Based On Another DataFrame in Pandas
In this quick tutorial, we'll cover how we can replace values in a column based on values from another DataFrame in Pandas.
Mapping the values from another DataFrame, depends on several factors like:
- Index matching
- Update only NaN values, add new column or replace everything
In this article, we are going to answer on all questions in a different steps.
Step 1: Create sample DataFrame
For this article we are going to use data from Kaggle: How to Search and Download Kaggle Dataset to Pandas DataFrame
Reading the initial data:
import pandas as pd
df1 = pd.read_csv(f'../data/earthquakes_1965_2016_database.csv.zip')
Next we are going to create a second DataFrame. We are going to replace the values from the first to the second one.
The second DataFrame will be created from the values of the first for one - Latitude
and Longitude
and returning the country.
import geocoder
def geo_rev(x):
g = geocoder.osm([x.Latitude, x.Longitude], method='reverse').json
if g:
return g.get('country')
else:
return 'no country'
df2 = pd.DataFrame({'country': df1[['Latitude', 'Longitude']].apply(geo_rev, axis=1)})
So far we have two DataFrames:
df1
Date | Latitude | Longitude | Depth | ID | |
---|---|---|---|---|---|
23405 | 12/27/2016 | 45.7192 | 26.5230 | 97.0 | US10007N3R |
23406 | 12/28/2016 | 38.3754 | -118.8977 | 10.8 | NN00570709 |
23407 | 12/28/2016 | 38.3917 | -118.8941 | 12.3 | NN00570710 |
23408 | 12/28/2016 | 38.3777 | -118.8957 | 8.8 | NN00570744 |
23409 | 12/28/2016 | 36.9179 | 140.4262 | 10.0 | US10007NAF |
df2
country | |
---|---|
23405 | România |
23406 | United States |
23407 | United States |
23408 | United States |
23409 | 日本 |
Step 2: Replace Values with matching indices
To add a single column - ID
we can use method loc
in order to set the values from DataFrame df1
to df2
:
df2.loc[:, ['ID']] = df1[['ID']]
This is possible because both DataFrames have identical indices and shapes. Otherwise error or unexpected results might happen.
If you need to replace values for multiple columns from another DataFrame - this is the syntax:
df2.loc[:, ['Latitude', 'Longitude']] = df1[['Latitude', 'Longitude']]
The two columns are added from df1
to df2
:
country | ID | Latitude | Longitude | |
---|---|---|---|---|
23405 | România | US10007N3R | 45.7192 | 26.5230 |
23406 | United States | NN00570709 | 38.3754 | -118.8977 |
23407 | United States | NN00570710 | 38.3917 | -118.8941 |
23408 | United States | NN00570744 | 38.3777 | -118.8957 |
23409 | 日本 | US10007NAF | 36.9179 | 140.4262 |
Step 3: Replace Values with non matching indices
What will happen if the indexes do not match? Let's create one more DataFrame df3
which is a copy of 2.
For this new DataFrame we are going to reset the index by:
df3 = df2.head(7).reset_index().copy()
If we try to use the same technique for setting values from another DataFrame we will get only NaN
values:
df3.loc[:, ['Latitude', 'Longitude']] = df1[['Latitude', 'Longitude']]
df3
because the indices doesn't match:
index | country | ID | Latitude | Longitude | |
---|---|---|---|---|---|
0 | 23405 | România | US10007N3R | NaN | NaN |
1 | 23406 | United States | NN00570709 | NaN | NaN |
2 | 23407 | United States | NN00570710 | NaN | NaN |
3 | 23408 | United States | NN00570744 | NaN | NaN |
4 | 23409 | 日本 | US10007NAF | NaN | NaN |
In order to make it work we need to modify the code. We are going to use column ID
as a reference between the two DataFrames.
Two columns 'Latitude', 'Longitude'
will be set from DataFrame df1
to df2
.
So to replace values from another DataFrame when different indices we can use:
col = 'ID'
cols_to_replace = ['Latitude', 'Longitude']
df3.loc[df3[col].isin(df1[col]), cols_to_replace] = df1.loc[df1[col].isin(df3[col]),cols_to_replace].values
Now the values are correctly set:
country | ID | Latitude | Longitude | |
---|---|---|---|---|
23405 | România | US10007N3R | 45.7192 | 26.5230 |
23406 | United States | NN00570709 | 38.3754 | -118.8977 |
23407 | United States | NN00570710 | 38.3917 | -118.8941 |
23408 | United States | NN00570744 | 38.3777 | -118.8957 |
23409 | 日本 | US10007NAF | 36.9179 | 140.4262 |
Step 4: Insert new column with values from another DataFrame by merge
You can use Pandas merge
function in order to get values and columns from another DataFrame. For this purpose you will need to have reference column between both DataFrames or use the index.
In this example we are going to use reference column ID
- we will merge df1
left join on df4
. It's important to mention two points:
ID
- should be unique value- the column which is updated should not exists in the first DataFrame -
df4
df4 = df4.merge(df1,on='ID',how="left")
the result is:
country | ID | Date | Time | Latitude | Longitude | Depth |
---|---|---|---|---|---|---|
România | US10007N3R | 12/27/2016 | 23:20:56 | 45.7192 | 26.5230 | 97.0 |
United States | NN00570709 | 12/28/2016 | 08:18:01 | 38.3754 | -118.8977 | 10.8 |
So all columns which has a match on the ID
- this column is unique per row - will get corresponding data from df1
Step 5: Update missing Values from Another DataFrame
In this step we are going to update only missing values in a column in one DataFrame from another.
So let's have next DataFrame:
import numpy as np
df5.loc[:, ['ID', 'Longitude']] = df1[['ID', 'Longitude']]
df5.iloc[[1,3,4], -1] = np.NaN
df5.head(5)
with few missing values in column Longitude
:
country | ID | Longitude | |
---|---|---|---|
23405 | România | US10007N3R | 26.5230 |
23406 | United States | NN00570709 | NaN |
23407 | United States | NN00570710 | -118.8941 |
23408 | United States | NN00570744 | NaN |
23409 | 日本 | US10007NAF | NaN |
To update the values in df5
from df1
we can merge both DataFrames on a reference column - ID
(should be unique).
Then we are going to fill all missing values with the values from the column Longitude_x
of df1
to the new column of df5
- Longitude_y
.
Next we are going to drop the column from df1
and rename the new column:
df5 = df5.merge(df1[['Longitude', 'ID']],on='ID',how="left")
df5['Longitude_y'] = df5['Longitude_y'].fillna(df5['Longitude_x'])
df5.drop(["Longitude_x"], inplace=True, axis=1)
df5.rename(columns={'Longitude_y':'Longitude'},inplace=True)
result:
country | ID | Longitude | |
---|---|---|---|
0 | România | US10007N3R | 26.5230 |
1 | United States | NN00570709 | -118.8977 |
2 | United States | NN00570710 | -118.8941 |
3 | United States | NN00570744 | -118.8957 |
4 | 日本 | US10007NAF | 140.4262 |
Step 6: Update column from another column with np.where
Finally let's cover how column can be added or updated from another column or DataFrame with np.where
.
First we will check if one column contains a value and create another column:
import numpy as np
df1['new_lon'] = np.where(df1['Longitude']>10,df1['Longitude'],np.nan)
If we have two DataFrames we can use similar syntax as follow:
df1['name'] = np.where(df2['Longitude']==1,df2['name'],df1['name'])
What is going on here is the following:
- check if DataFrame
df2
contains rows with value 1 - all the rest will be taken from
df1
- create new column in
df1
with the result from previousnp.where
Note that the two DataFrames should have the same number of rows.