How to Read Excel or CSV With Multiple Line Headers Using Pandas
In this quick Pandas tutorial, we'll cover how we can read Excel sheet or CSV file with multiple header rowswith Python/Pandas. Reading multi-line headers with Pandas creates a MultiIndex.
Reading multiple headers from a CSV or Excel files can be done by using parameter - header
of method read_csv
:
import pandas as pd
df = pd.read_csv('../data/csv/multine_header.csv', header=[0,1])
In the rest of the article we will cover different examples and details about using header=[0,1]
.
Step 1: Sample CSV or Excel sheet
To start lets create a simple CSV file named: multine_header.csv
and show how we can read the multi rows header with Pandas read_csv method.
The content of the file is:
Date,Company A,Company A,Company B,Company A
,Rank,Points,Rank,Points
2021-09-06,1,7.9,2,6
2021-09-07,1,8.5,2,7
2021-09-08,2,8,1,8.1
Data from the above file shown in a tabular form is(the same is if we read the CSV without the multi row header):
Date | Company A | Company A.1 | Company B | Company B.1 |
---|---|---|---|---|
NaN | Rank | Points | Rank | Points |
2021-09-06 | 1 | 7.9 | 2 | 6 |
2021-09-07 | 1 | 8.5 | 2 | 7 |
2021-09-08 | 2 | 8 | 1 | 8.1 |
Step 2: Read CSV file with multiple headers
To read the above CSV file which has two headers we can use read_csv
with a combination of parameter header
.
The parameter is described as:
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None.
So if a CSV file has two rows as a headers we can read them by:
import pandas as pd
df = pd.read_csv('../data/csv/multine_header.csv', header=[0,1])
Result:
Date | Company A | Company B | ||
---|---|---|---|---|
Unnamed: 0_level_1 | Rank | Points | Rank | Points |
2021-09-06 | 1 | 7.9 | 2 | 6.0 |
2021-09-07 | 1 | 8.5 | 2 | 7.0 |
2021-09-08 | 2 | 8.0 | 1 | 8.1 |
Now we can notice that DataFrame has two levels of columns. In next step we can find how to access the data.
Read CSV file with 3 and more headers
To read CSV file with more than two rows as headers we can use:
df = pd.read_csv('../data/csv/multine_header.csv', header=[0,1,2])
Step 3: Access data from multi-line header DataFrame
In order to access columns of the above DataFrame we need to use MultiIndex syntax.
First let's find what are the column names by:
df.columns
The column names are presented as tuple pairs because they are a MultiIndex:
MultiIndex([( 'Date', 'Unnamed: 0_level_1'),
('Company A', 'Rank'),
('Company A', 'Points'),
('Company B', 'Rank'),
('Company B', 'Points')],
So in order to access column Company A
- Rank
we will need to use the following syntax:
df[('Company A', 'Rank')]
result:
0 1
1 1
2 2
Name: (Company A, Rank), dtype: int64
and to read multiple columns from multi-line header we can use:
df[[('Company A', 'Rank'), ('Company B', 'Rank')]]
result:
Company A | Company B |
---|---|
Rank | Rank |
1 | 2 |
1 | 2 |
2 | 1 |
Step 4: Read Excel file with multiple headers
Pandas can read excel sheets with multiple headers the same way as the CSV files.
Below you can find the code for reading multiple headers from excel file:
pd.read_excel('../data/excel/multine_header.xlsx', sheet_name="multine_header", header=[0,1])
Where the file name is: multine_header.xlsx
, the sheet name is multine_header
.
To learn more about reading Excel files with Python and Pandas please check: Read Excel XLS with Python Pandas