When working with data, you may need to compare two Excel or CSV files to:
- find differences
- detect updates
- history changes
- mistakes
- validate records etc
In this guide, we'll explore how to compare two Excel or CSV files using Pandas, with practical examples.
Why Compare Files with Pandas
Comparing datasets is useful when:
- Validating data consistency
- Detecting changes between old and new versions
- Finding missing or extra rows
- Comparing specific columns for modifications
- Data comes from different sources
Let's dive into two approaches to compare files using Pandas.
Example 1: Compare Two CSV Files and Find Differences
Step 1: Load the CSV Files into Pandas
Assume we have two CSV files:
- old_data.csv (original dataset)
- new_data.csv (updated dataset)
import pandas as pd
# Load the CSV files
df_old = pd.read_csv("old_data.csv")
df_new = pd.read_csv("new_data.csv")
print("Old Data:\n", df_old.head())
print("\nNew Data:\n", df_new.head())
We can already check if there are some differences between the rows and columnts for the files.
You can find the data for both CSV files below:
new_data.csv
name,age,city
Alice,25,New York
Bobby,30,Los Angeles
Charlie,35,Chicago
David,37,San Francisco
Ema,28,Houston
and
old_data.csv
name,age,city
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago
David,40,San Francisco
Emma,28,Houston
Step 2: Find Differences Between the Files
To detect changes, we can use merge with an outer join and keep only the differences:
df_new.compare(df_old, keep_shape=False, keep_equal=True)
or keep the original data shape by:
df_new.compare(df_old, keep_shape=False, keep_equal=True)
result:
name | age | |||
---|---|---|---|---|
self | other | self | other | |
1 | Bobby | Bob | 30 | 30 |
3 | David | David | 37 | 40 |
4 | Ema | Emma | 28 | 28 |
From the result table we can see 3 differencies:
- Bobby <-> Bob
- 37 <-> 40
- Ema <-> Emma
If you like to find out how to highlight the changes you can check this article: How to Compare Two Pandas DataFrames and Get Differences
Step 3: Check number of differences
To check what is the number of the differences you can use:
((df_old.fillna(1) == df_new.fillna(1)).melt()['value'] == 0).sum()
which give us 3.
Example 2: Compare Specific Columns for Changes
If both files have the same structure but contain modified values, we can compare specific columns.
Step 1: Load the Excel Files
df_old = pd.read_excel("old_data.xlsx")
df_new = pd.read_excel("new_data.xlsx")
Note: At this step you can compare individual sheets. For example you can download the sheet from Google Drive by:
- Open the file from Google Sheets
- File
- Download
- Comma Separated Values (.csv)
Step 2: Identify Modified Rows
We use df.compare()
to highlight differences in matching rows.
# Compare only specific columns
differences = df_old.compare(df_new)
differences
output:
name | age | |||
---|---|---|---|---|
self | other | self | other | |
1 | Bobby | Bob | 30 | 30 |
3 | David | David | 37 | 40 |
4 | Ema | Emma | 28 | 28 |
Conclusion
With Pandas, comparing two CSV or Excel files is simple and effective. We explored:
- Detecting modified values using
compare()
- Extracting only differences
- Counting the differences
These methods help ensure data accuracy when working with different versions of datasets.