When working with pandas, comparing DataFrames that contain NaN values can be confusing and error prone. By defaultin Python, NaN is not equal to NaN in standard element-wise comparisons, which often leads to unexpected results.

Sample data:

import numpy as np
import pandas as pd

df1 = pd.DataFrame([[np.nan,1, np.nan, 3],[2, 1, np.nan,3]])

df2 = df1.copy()
0 1 2 3
0 NaN 1 NaN 3
1 2.0 1 NaN 3

Why NaN breaks equality checks?

In NumPy and pandas, NaN represents missing data. According to IEEE standards, NaN is not equal to anything — even another NaN. This affects comparisons like:

df1 == df2

result:
0 1 2 3
0 False True False True
1 True True False True

Even if both DataFrames have NaN in the same positions, the result will be False for those cells.

Use DataFrame.equals() for proper comparison

If you want to check whether two DataFrames are truly identical, including NaNs in the same locations, use:

df1.equals(df2)

result:

True

Key benefits of equals():

  • Treats NaNs in the same position as equal
  • Requires same shape and values
  • Ignores index/column type differences if values match

This is the recommended way to compare DataFrames for equality.

Normalize missing values before comparing
If one DataFrame uses empty strings and the other uses NaN, equals() will return False. You can standardize them first:

Replace NaN with empty strings

df1.fillna('') == df2.fillna('')

result:

      0     1     2     3
0  True  True  True  True
1  True  True  True  True

Element-wise comparison with NaNs treated as equal
If you need element-wise comparison logic, use NumPy:

import numpy as np

np.isclose(df1, df2, equal_nan=True)
[[ True  True  True  True]
 [ True  True  True  True]]

Or for mixed dtypes:

(df1.fillna('##NA##') == df2.fillna('##NA##'))

Summary

  • NaN != NaN in standard comparisons
  • Use df.equals() for full DataFrame equality
  • Normalize NaN and empty values if sources differ
  • Use NumPy for element-wise comparisons with equal_nan

This avoids false mismatches and ensures consistent DataFrame comparisons when missing data is involved.

Inspiration

The article was inspired from the Kaggle course for data cleaning:

The homework for deepening the understanding with a dataset of fatal police shootings in the US:

Where we have:

(police_killings1 != police_killings).melt()['value'].sum()

results into:

346

while:

(police_killings1.fillna(0) != police_killings.fillna(0)).melt()['value'].sum()

results into:

0

And the reason for this are the missing values:

for col in police_killings1.columns:
    if sum(police_killings1[col] != police_killings[col]) > 0:
        print(col)
        print(police_killings1[police_killings1[col] != police_killings[col]][col])
        print(police_killings[police_killings1[col] != police_killings[col]][col])

result:

armed
615     NaN
1551    NaN
1715    NaN
1732    NaN
1825    NaN
...
2487    NaN
Name: armed, dtype: object

Resources