How to Replace Values in Column Based On Another DataFrame in Pandas

In this quick tutorial, we'll cover how we can replace values in a column based on values from another DataFrame in Pandas.

Mapping the values from another DataFrame, depends on several factors like:

Index matching
Update only NaN values, add new column or replace everything

In this article, we are going to answer on all questions in a different steps.

Step 1: Create sample DataFrame

For this article we are going to use data from Kaggle: How to Search and Download Kaggle Dataset to Pandas DataFrame

Reading the initial data:

import pandas as pd

df1 = pd.read_csv(f'../data/earthquakes_1965_2016_database.csv.zip')

Next we are going to create a second DataFrame. We are going to replace the values from the first to the second one.

The second DataFrame will be created from the values of the first for one - Latitude and Longitude and returning the country.

import geocoder

def geo_rev(x):
    g = geocoder.osm([x.Latitude, x.Longitude], method='reverse').json
    if g:
        return g.get('country')
    else:
        return 'no country'

df2 = pd.DataFrame({'country': df1[['Latitude', 'Longitude']].apply(geo_rev, axis=1)})

So far we have two DataFrames:

df1

	Date	Latitude	Longitude	Depth	ID
23405	12/27/2016	45.7192	26.5230	97.0	US10007N3R
23406	12/28/2016	38.3754	-118.8977	10.8	NN00570709
23407	12/28/2016	38.3917	-118.8941	12.3	NN00570710
23408	12/28/2016	38.3777	-118.8957	8.8	NN00570744
23409	12/28/2016	36.9179	140.4262	10.0	US10007NAF

df2

	country
23405	România
23406	United States
23407	United States
23408	United States
23409	日本

Step 2: Replace Values with matching indices

To add a single column - ID we can use method loc in order to set the values from DataFrame df1 to df2:

df2.loc[:, ['ID']] = df1[['ID']]

This is possible because both DataFrames have identical indices and shapes. Otherwise error or unexpected results might happen.

If you need to replace values for multiple columns from another DataFrame - this is the syntax:

df2.loc[:, ['Latitude', 'Longitude']] = df1[['Latitude', 'Longitude']]

The two columns are added from df1 to df2:

	country	ID	Latitude	Longitude
23405	România	US10007N3R	45.7192	26.5230
23406	United States	NN00570709	38.3754	-118.8977
23407	United States	NN00570710	38.3917	-118.8941
23408	United States	NN00570744	38.3777	-118.8957
23409	日本	US10007NAF	36.9179	140.4262

Step 3: Replace Values with non matching indices

What will happen if the indexes do not match? Let's create one more DataFrame df3 which is a copy of 2.

For this new DataFrame we are going to reset the index by:

df3 = df2.head(7).reset_index().copy()

If we try to use the same technique for setting values from another DataFrame we will get only NaN values:

df3.loc[:, ['Latitude', 'Longitude']] = df1[['Latitude', 'Longitude']]
df3

because the indices doesn't match:

	index	country	ID	Latitude	Longitude
0	23405	România	US10007N3R	NaN	NaN
1	23406	United States	NN00570709	NaN	NaN
2	23407	United States	NN00570710	NaN	NaN
3	23408	United States	NN00570744	NaN	NaN
4	23409	日本	US10007NAF	NaN	NaN

In order to make it work we need to modify the code. We are going to use column ID as a reference between the two DataFrames.

Two columns 'Latitude', 'Longitude' will be set from DataFrame df1 to df2.

So to replace values from another DataFrame when different indices we can use:

col = 'ID'
cols_to_replace = ['Latitude', 'Longitude']
df3.loc[df3[col].isin(df1[col]), cols_to_replace] = df1.loc[df1[col].isin(df3[col]),cols_to_replace].values

Now the values are correctly set:

	country	ID	Latitude	Longitude
23405	România	US10007N3R	45.7192	26.5230
23406	United States	NN00570709	38.3754	-118.8977
23407	United States	NN00570710	38.3917	-118.8941
23408	United States	NN00570744	38.3777	-118.8957
23409	日本	US10007NAF	36.9179	140.4262

Step 4: Insert new column with values from another DataFrame by merge

You can use Pandas merge function in order to get values and columns from another DataFrame. For this purpose you will need to have reference column between both DataFrames or use the index.

In this example we are going to use reference column ID - we will merge df1 left join on df4. It's important to mention two points:

ID - should be unique value
the column which is updated should not exists in the first DataFrame - df4

df4 = df4.merge(df1,on='ID',how="left")

the result is:

country	ID	Date	Time	Latitude	Longitude	Depth
România	US10007N3R	12/27/2016	23:20:56	45.7192	26.5230	97.0
United States	NN00570709	12/28/2016	08:18:01	38.3754	-118.8977	10.8

So all columns which has a match on the ID - this column is unique per row - will get corresponding data from df1

Step 5: Update missing Values from Another DataFrame

In this step we are going to update only missing values in a column in one DataFrame from another.

So let's have next DataFrame:

import numpy as np
df5.loc[:, ['ID', 'Longitude']] = df1[['ID', 'Longitude']]


df5.iloc[[1,3,4], -1] = np.NaN
df5.head(5)

with few missing values in column Longitude:

	country	ID	Longitude
23405	România	US10007N3R	26.5230
23406	United States	NN00570709	NaN
23407	United States	NN00570710	-118.8941
23408	United States	NN00570744	NaN
23409	日本	US10007NAF	NaN

To update the values in df5 from df1 we can merge both DataFrames on a reference column - ID (should be unique).

Then we are going to fill all missing values with the values from the column Longitude_x of df1 to the new column of df5 - Longitude_y.

Next we are going to drop the column from df1 and rename the new column:

df5 = df5.merge(df1[['Longitude', 'ID']],on='ID',how="left")

df5['Longitude_y'] = df5['Longitude_y'].fillna(df5['Longitude_x'])

df5.drop(["Longitude_x"], inplace=True, axis=1)
df5.rename(columns={'Longitude_y':'Longitude'},inplace=True)

result:

	country	ID	Longitude
0	România	US10007N3R	26.5230
1	United States	NN00570709	-118.8977
2	United States	NN00570710	-118.8941
3	United States	NN00570744	-118.8957
4	日本	US10007NAF	140.4262

Step 6: Update column from another column with np.where

Finally let's cover how column can be added or updated from another column or DataFrame with np.where.

First we will check if one column contains a value and create another column:

import numpy as np
df1['new_lon'] = np.where(df1['Longitude']>10,df1['Longitude'],np.nan)

If we have two DataFrames we can use similar syntax as follow:

df1['name'] = np.where(df2['Longitude']==1,df2['name'],df1['name'])

What is going on here is the following:

check if DataFrame df2 contains rows with value 1
all the rest will be taken from df1
create new column in df1 with the result from previous np.where

Note that the two DataFrames should have the same number of rows.