When working with pandas, comparing DataFrames that contain NaN values can be confusing and error prone. By defaultin Python, NaN is not equal to NaN in standard element-wise comparisons, which often leads to unexpected results.
Sample data:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([[np.nan,1, np.nan, 3],[2, 1, np.nan,3]])
df2 = df1.copy()
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | NaN | 1 | NaN | 3 |
| 1 | 2.0 | 1 | NaN | 3 |
Why NaN breaks equality checks?
In NumPy and pandas, NaN represents missing data. According to IEEE standards, NaN is not equal to anything — even another NaN. This affects comparisons like:
df1 == df2
result:
0 1 2 3
0 False True False True
1 True True False True
Even if both DataFrames have NaN in the same positions, the result will be False for those cells.
Use DataFrame.equals() for proper comparison
If you want to check whether two DataFrames are truly identical, including NaNs in the same locations, use:
df1.equals(df2)
result:
True
Key benefits of equals():
- Treats NaNs in the same position as equal
- Requires same shape and values
- Ignores index/column type differences if values match
This is the recommended way to compare DataFrames for equality.
Normalize missing values before comparing
If one DataFrame uses empty strings and the other uses NaN, equals() will return False. You can standardize them first:
Replace NaN with empty strings
df1.fillna('') == df2.fillna('')
result:
0 1 2 3
0 True True True True
1 True True True True
Element-wise comparison with NaNs treated as equal
If you need element-wise comparison logic, use NumPy:
import numpy as np
np.isclose(df1, df2, equal_nan=True)
[[ True True True True]
[ True True True True]]
Or for mixed dtypes:
(df1.fillna('##NA##') == df2.fillna('##NA##'))
Summary
- NaN != NaN in standard comparisons
- Use df.equals() for full DataFrame equality
- Normalize NaN and empty values if sources differ
- Use NumPy for element-wise comparisons with equal_nan
This avoids false mismatches and ensures consistent DataFrame comparisons when missing data is involved.
Inspiration
The article was inspired from the Kaggle course for data cleaning:
The homework for deepening the understanding with a dataset of fatal police shootings in the US:
Where we have:
(police_killings1 != police_killings).melt()['value'].sum()
results into:
346
while:
(police_killings1.fillna(0) != police_killings.fillna(0)).melt()['value'].sum()
results into:
0
And the reason for this are the missing values:
for col in police_killings1.columns:
if sum(police_killings1[col] != police_killings[col]) > 0:
print(col)
print(police_killings1[police_killings1[col] != police_killings[col]][col])
print(police_killings[police_killings1[col] != police_killings[col]][col])
result:
armed
615 NaN
1551 NaN
1715 NaN
1732 NaN
1825 NaN
...
2487 NaN
Name: armed, dtype: object