In this quick tutorial, we'll cover how we can replace values in a column based on values from another DataFrame in Pandas.

Mapping the values from another DataFrame, depends on several factors like:

  • Index matching
  • Update only NaN values, add new column or replace everything

In this article, we are going to answer on all questions in a different steps.

Step 1: Create sample DataFrame

For this article we are going to use data from Kaggle: How to Search and Download Kaggle Dataset to Pandas DataFrame

Reading the initial data:

import pandas as pd

df1 = pd.read_csv(f'../data/earthquakes_1965_2016_database.csv.zip')

Next we are going to create a second DataFrame. We are going to replace the values from the first to the second one.

The second DataFrame will be created from the values of the first for one - Latitude and Longitude and returning the country.

import geocoder

def geo_rev(x):
    g = geocoder.osm([x.Latitude, x.Longitude], method='reverse').json
    if g:
        return g.get('country')
    else:
        return 'no country'

df2 = pd.DataFrame({'country': df1[['Latitude', 'Longitude']].apply(geo_rev, axis=1)})

So far we have two DataFrames:

  • df1
Date Latitude Longitude Depth ID
23405 12/27/2016 45.7192 26.5230 97.0 US10007N3R
23406 12/28/2016 38.3754 -118.8977 10.8 NN00570709
23407 12/28/2016 38.3917 -118.8941 12.3 NN00570710
23408 12/28/2016 38.3777 -118.8957 8.8 NN00570744
23409 12/28/2016 36.9179 140.4262 10.0 US10007NAF
  • df2
country
23405 România
23406 United States
23407 United States
23408 United States
23409 日本

Step 2: Replace Values with matching indices

To add a single column - ID we can use method loc in order to set the values from DataFrame df1 to df2:

df2.loc[:, ['ID']] = df1[['ID']]

This is possible because both DataFrames have identical indices and shapes. Otherwise error or unexpected results might happen.

If you need to replace values for multiple columns from another DataFrame - this is the syntax:

df2.loc[:, ['Latitude', 'Longitude']] = df1[['Latitude', 'Longitude']]

The two columns are added from df1 to df2:

country ID Latitude Longitude
23405 România US10007N3R 45.7192 26.5230
23406 United States NN00570709 38.3754 -118.8977
23407 United States NN00570710 38.3917 -118.8941
23408 United States NN00570744 38.3777 -118.8957
23409 日本 US10007NAF 36.9179 140.4262

Step 3: Replace Values with non matching indices

What will happen if the indexes do not match? Let's create one more DataFrame df3 which is a copy of 2.

For this new DataFrame we are going to reset the index by:

df3 = df2.head(7).reset_index().copy()

If we try to use the same technique for setting values from another DataFrame we will get only NaN values:

df3.loc[:, ['Latitude', 'Longitude']] = df1[['Latitude', 'Longitude']]
df3

because the indices doesn't match:

index country ID Latitude Longitude
0 23405 România US10007N3R NaN NaN
1 23406 United States NN00570709 NaN NaN
2 23407 United States NN00570710 NaN NaN
3 23408 United States NN00570744 NaN NaN
4 23409 日本 US10007NAF NaN NaN

In order to make it work we need to modify the code. We are going to use column ID as a reference between the two DataFrames.

Two columns 'Latitude', 'Longitude' will be set from DataFrame df1 to df2.

So to replace values from another DataFrame when different indices we can use:

col = 'ID'
cols_to_replace = ['Latitude', 'Longitude']
df3.loc[df3[col].isin(df1[col]), cols_to_replace] = df1.loc[df1[col].isin(df3[col]),cols_to_replace].values

Now the values are correctly set:

country ID Latitude Longitude
23405 România US10007N3R 45.7192 26.5230
23406 United States NN00570709 38.3754 -118.8977
23407 United States NN00570710 38.3917 -118.8941
23408 United States NN00570744 38.3777 -118.8957
23409 日本 US10007NAF 36.9179 140.4262

Step 4: Insert new column with values from another DataFrame by merge

You can use Pandas merge function in order to get values and columns from another DataFrame. For this purpose you will need to have reference column between both DataFrames or use the index.

In this example we are going to use reference column ID - we will merge df1 left join on df4. It's important to mention two points:

  • ID - should be unique value
  • the column which is updated should not exists in the first DataFrame - df4
df4 = df4.merge(df1,on='ID',how="left")

the result is:

country ID Date Time Latitude Longitude Depth
România US10007N3R 12/27/2016 23:20:56 45.7192 26.5230 97.0
United States NN00570709 12/28/2016 08:18:01 38.3754 -118.8977 10.8

So all columns which has a match on the ID - this column is unique per row - will get corresponding data from df1

Step 5: Update missing Values from Another DataFrame

In this step we are going to update only missing values in a column in one DataFrame from another.

So let's have next DataFrame:

import numpy as np
df5.loc[:, ['ID', 'Longitude']] = df1[['ID', 'Longitude']]


df5.iloc[[1,3,4], -1] = np.NaN
df5.head(5)

with few missing values in column Longitude:

country ID Longitude
23405 România US10007N3R 26.5230
23406 United States NN00570709 NaN
23407 United States NN00570710 -118.8941
23408 United States NN00570744 NaN
23409 日本 US10007NAF NaN

To update the values in df5 from df1 we can merge both DataFrames on a reference column - ID (should be unique).

Then we are going to fill all missing values with the values from the column Longitude_x of df1 to the new column of df5 - Longitude_y.

Next we are going to drop the column from df1 and rename the new column:

df5 = df5.merge(df1[['Longitude', 'ID']],on='ID',how="left")

df5['Longitude_y'] = df5['Longitude_y'].fillna(df5['Longitude_x'])

df5.drop(["Longitude_x"], inplace=True, axis=1)
df5.rename(columns={'Longitude_y':'Longitude'},inplace=True)

result:

country ID Longitude
0 România US10007N3R 26.5230
1 United States NN00570709 -118.8977
2 United States NN00570710 -118.8941
3 United States NN00570744 -118.8957
4 日本 US10007NAF 140.4262

Step 6: Update column from another column with np.where

Finally let's cover how column can be added or updated from another column or DataFrame with np.where.

First we will check if one column contains a value and create another column:

import numpy as np
df1['new_lon'] = np.where(df1['Longitude']>10,df1['Longitude'],np.nan)

If we have two DataFrames we can use similar syntax as follow:

df1['name'] = np.where(df2['Longitude']==1,df2['name'],df1['name'])

What is going on here is the following:

  • check if DataFrame df2 contains rows with value 1
  • all the rest will be taken from df1
  • create new column in df1 with the result from previous np.where

Note that the two DataFrames should have the same number of rows.

Resources