In this article, we will see how to compare (with highlight of differences) two columns in Pandas DataFrame.

If you like to see how to compare two DataFrames in Pandas please check: How to Compare Two Pandas DataFrames and Get Differences

Setup

In the post, we'll use the following DataFrame, which consists of several rows and columns:

import pandas as pd

data = {'col1': ['apple', 'pear', 'apple', 'orange', 'lemon'], 'col2': ['apple', 'pear', 'lemon', 'orange', 'apple']}
df = pd.DataFrame(data, columns = ['col1', 'col2'])

DataFrame looks like:

col1 col2
0 apple apple
1 pear pear
2 apple lemon
3 orange orange
4 lemon apple

Step 1: Compare with map and lambda

The first way to compare the two columns of a DataFrame is by chaining:

  • map
  • lambda
df.style.apply(lambda x: (x != df['col1']).map({True: 'background-color: yellow', False: ''}), subset=['col2'])

We need to enter the column names. After the execution the differences between the two columns will be highlighted in yellow:

  col1 col2
0 apple apple
1 pear pear
2 apple lemon
3 orange orange
4 lemon apple

Step 2: Compare and highlight by custom function

In this step we are going to define a function to compare the first two columns of a DataFrame. After that we are going to highlight the row which has a difference in the values.

The columns will be chosen by index: d.columns[0]

def highlight_diff(d):
    df = pd.DataFrame(columns=d.columns, index=d.index)
    
    col1 = d.columns[0]
    col2 = d.columns[1]
    df[[col1, col2]] = 'background: None'
    
    df.loc[d[col1].ne(d[col2]), [col1, col2]] = 'background: yellow'

    return df
    
df.style.apply(highlight_diff, axis=None)

The resulted DataFrame contains highlighted rows where information is different for the compared columns:

  col1 col2
0 apple apple
1 pear pear
2 apple lemon
3 orange orange
4 lemon apple

https://datascientyst.com/content/images/2022/08/compare-two-columns-highlight-pandas.png

Step 3. Compare and get difference

If we want to compare two columns and get only the different rows we can use method: compare():

df['col1'].compare(df['col2'])

In that case we will get only the rows which has some difference:

self other
2 apple lemon
4 lemon apple

Step 4. Compare columns of different DataFrames

If we need to compare columns from different DataFrames like the example below:

import pandas as pd

data = {'col1': ['apple', 'pear', 'apple', 'orange', 'lemon'],
        'col2': ['apple', 'pear', 'lemon', 'orange', 'apple']}
df1 = pd.DataFrame(data, columns = ['col1'])
df2 = pd.DataFrame(data, columns = ['col2'])

we can do direct comparison by:

df1['col1'] != df2['col2']

which will give us:

0    False
1    False
2     True
3    False
4     True
dtype: bool

or we can concatenate them into a new DataFrame:

pd.concat([df1['col1'], df2['col2']], axis=1)

and then apply the previous steps.

Step 5. Compare multiple columns

Let's see how we can compare multiple columns at the same time from one DataFrame. Let's work with the following DataFrame:

import pandas as pd

data = {'col1': ['apple', 'pear', 'apple', 'orange', 'lemon'],
        'col2': ['apple', 'pear', 'lemon', 'orange', 'apple'],
        'col3': [1,1,0,1,0], 'col4': [1,1,0,1,0]}
df = pd.DataFrame(data, columns = ['col1', 'col3', 'col2', 'col4'])

data looks like:

col1 col3 col2 col4
0 apple 1 apple 1
1 pear 1 pear 1
2 apple 0 lemon 0
3 orange 1 orange 1
4 lemon 0 apple 0

To compare the first two columns with the last two columns we can use broadcasting with method all():

(df.iloc[:, 0:2].values == (df.iloc[:, 2:4].values)).all(axis=1)

This will give us:

array([ True,  True, False,  True, False])

Showing that the first two rows have identical values for that comparison.

Depending on the context we can use this method to compare multiple columns with different operators like:

  • > - greater
  • - - subtraction
  • different column numbers

Conclusion

In this article we saw how to compare two columns in Pandas. We covered highlighting the differences or just extracting them.

We also covered how to compare multiple columns or columns from different DataFrames. It was also a way for conditional comparison of columns.