How to Deal With Whitespace and Irregular Separators in Pandas Read CSV?
In this post, we will try to address how to deal with:
- consecutive whitespaces as delimiters
- irregular Separators
while reading a CSV file with Pandas read_csv()
method.
Steps work with irregular separators
- Inspect the CSV file
- Select Pandas method:
read_csv
read_txt
read_fwf
- Test reading with different parameters
skipinitialspace
engine
delim_whitespace
escapechar
- Select separator - normal one or regex
sep='\s+'
sep='\r\t'
Data
Suppose we would like to read the following csv file into Pandas DataFrame: "data.csv":
Date Company A Company A Company B Company B
2021-09-06 1 7.9 2 6
2021-09-07 1 8.5 2 7
2021-09-08 2 8 1 8.1
Example - regex separator - sep='\s{3}'
To correctly read data into DataFrame we need to use a combination of arguments: sep='\s{3}', engine='python'
.
This is needed because we have exactly 3 spaces as delimiter:
import pandas as pd
df = pd.read_csv('data.csv', sep='\s{3}', engine='python')
The resulted DataFrame is:
Date | Company A | Company A.1 | Company B | Company B.1 | |
---|---|---|---|---|---|
0 | 2021-09-06 | 1 | 7.9 | 2 | 6.0 |
1 | 2021-09-07 | 1 | 8.5 | 2 | 7.0 |
2 | 2021-09-08 | 2 | 8.0 | 1 | 8.1 |
Note: If we try to use:
pd.read_csv('data.csv')
- return single columnpd.read_csv('data.csv', sep='\s+')
- column names contain spacespd.read_csv('data.csv', sep='\s*')
- column names contain spaces
all of them will return wrong columns.
Example - regex separator - two and more spaces
We can also define use a regex to define separators as follow: two and more spaces: sep=r"[ ]{2,}"
import pandas as pd
df = pd.read_csv('data.csv', sep='[ ]{2,}', engine='python')
Alternative solution might be: sep='\t\s+'
combination of tabs and spaces as separator in Pandas.
Note: Why do we need engine='python'? This is explained from the warning:
Falling back to the 'python' engine because the 'c' engine does not support regex separators
Additional notes:
delim_whitespace
is equivalent tosep='\s+'
. If Truesep
can be skippedskipinitialspace
- skip spaces after delimiter- Separators longer than 1 character and different from '\s+'
- will be interpreted as regular expressions
- also force the use of the Python parsing engine.
- Always analyze and clean spaces because can cause issues:
- "hello " != " hello"
- how values of empty spaces should be treated - errors, missing values etc?