How to Merge CSV Files with Python (Pandas DataFrame)
In this short guide, we're going to merge multiple CSV files into a single CSV file with Python. We will also see how to read multiple CSV files - by wildcard matching - to a single DataFrame.
The code to merge several CSV files matched by pattern to a file or Pandas DataFrame is:
import glob
for f in glob.glob('file_*.csv'):
df_temp = pd.read_csv(f)
Setup
Suppose we have multiple CSV files like:
- file_202201.csv
- file_202202.csv
- file_202203.csv
into single CSV file like: merged.csv
1: Merge CSV files to DataFrame
To merge multiple CSV files to a DataFrame we will use the Python module - glob
. The module allow us to search for a file pattern with wildcard - *
.
import pandas as pd
import glob
df_files = []
for f in glob.glob('file_*.csv'):
df_temp = pd.read_csv(f)
df_files.append(df_temp)
df = pd.concat(df_files)
How does the code work?
- All files which match the pattern will be iterated in random order
- Temporary DataFrame is created for each file
- The temporary DataFrame is appended to list
- Finally all DataFrames are merged into a single one
2: Read CSV files without header
To skip the headers for the CSV files we can use parameter: header=None
read_csv('file_*.csv', header=None)
To add the headers only for the first file we can:
- read the first file with headers
- drop duplicates (keep first)
- set column names
All depends on the context.
3: Read sorted CSV files
Module glob
reads files without order. To ensure the correct order of the read CSV files we can use sorted
:
sorted(glob.glob('file_*.csv'))
This ensures that the final output CSV file or DataFrame will be loaded in a certain order.
Alternatively we can use parameters: ignore_index=True, , sort=True
for Pandas method concat
:
merged_df = pd.concat(all_df, ignore_index=True, sort=True)
4: Change CSV separator
We can control what is the separator symbol for the CSV files by using parameter:
sep = '\t'
Default one is comma.
5: Keep trace of CSV files
If we like to keep trace of each row loaded - from which CSV file is coming we can use: df_temp['file'] = f.split('/')[-1]
:
This will data a new column to each file with trace - the file name origin.
6: Merge CSV files to single with Python
Finally we can save the result into a single CSV file from Pandas Dataframe by:
df_merged.to_csv("merged.csv")
7. Full Code
Finally we can find the full example with most options mentioned earlier:
import pandas as pd
import glob
df_files = []
for f in sorted(glob.glob('file_*.txt')):
df_temp = pd.read_csv(f, header=None, index_col=False, sep='\t')
df_temp['file'] = f.split('/')[-1]
df_files.append(df_temp)
df_merged = pd.concat(df_files)
df_merged.to_csv("merged.csv")
Conclusion
We saw how to read multiple CSV files with Pandas and Python. Different options were covered like:
- changing separator for
read_csv
- keeping trace of the source file
- sorting files in certain order
- skipping headers