When working with data, you may need to compare two Excel or CSV files to:

  • find differences
  • detect updates
  • history changes
  • mistakes
  • validate records etc

In this guide, we'll explore how to compare two Excel or CSV files using Pandas, with practical examples.

Why Compare Files with Pandas

Comparing datasets is useful when:

  • Validating data consistency
  • Detecting changes between old and new versions
  • Finding missing or extra rows
  • Comparing specific columns for modifications
  • Data comes from different sources

Let's dive into two approaches to compare files using Pandas.

Example 1: Compare Two CSV Files and Find Differences

Step 1: Load the CSV Files into Pandas

Assume we have two CSV files:

  • old_data.csv (original dataset)
  • new_data.csv (updated dataset)
import pandas as pd

# Load the CSV files
df_old = pd.read_csv("old_data.csv")
df_new = pd.read_csv("new_data.csv")

print("Old Data:\n", df_old.head())
print("\nNew Data:\n", df_new.head())

We can already check if there are some differences between the rows and columnts for the files.

You can find the data for both CSV files below:

  • new_data.csv
name,age,city
Alice,25,New York
Bobby,30,Los Angeles
Charlie,35,Chicago
David,37,San Francisco
Ema,28,Houston

and

  • old_data.csv
name,age,city
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago
David,40,San Francisco
Emma,28,Houston

Step 2: Find Differences Between the Files

To detect changes, we can use merge with an outer join and keep only the differences:

df_new.compare(df_old, keep_shape=False, keep_equal=True)

or keep the original data shape by:

df_new.compare(df_old, keep_shape=False, keep_equal=True)

result:

name age
self other self other
1 Bobby Bob 30 30
3 David David 37 40
4 Ema Emma 28 28

From the result table we can see 3 differencies:

  • Bobby <-> Bob
  • 37 <-> 40
  • Ema <-> Emma

If you like to find out how to highlight the changes you can check this article: How to Compare Two Pandas DataFrames and Get Differences

Step 3: Check number of differences

To check what is the number of the differences you can use:

((df_old.fillna(1) == df_new.fillna(1)).melt()['value'] == 0).sum()

which give us 3.

Example 2: Compare Specific Columns for Changes

If both files have the same structure but contain modified values, we can compare specific columns.

Step 1: Load the Excel Files

df_old = pd.read_excel("old_data.xlsx")
df_new = pd.read_excel("new_data.xlsx")

Note: At this step you can compare individual sheets. For example you can download the sheet from Google Drive by:

  • Open the file from Google Sheets
  • File
  • Download
  • Comma Separated Values (.csv)

Step 2: Identify Modified Rows

We use df.compare() to highlight differences in matching rows.

# Compare only specific columns
differences = df_old.compare(df_new)

differences

output:

name age
self other self other
1 Bobby Bob 30 30
3 David David 37 40
4 Ema Emma 28 28

Conclusion

With Pandas, comparing two CSV or Excel files is simple and effective. We explored:

  • Detecting modified values using compare()
  • Extracting only differences
  • Counting the differences

These methods help ensure data accuracy when working with different versions of datasets.