In this post, we will try to address how to deal with:

  • consecutive whitespaces as delimiters
  • irregular Separators

while reading a CSV file with Pandas read_csv() method.

Steps work with irregular separators

  • Inspect the CSV file
  • Select Pandas method:
    • read_csv
    • read_txt
    • read_fwf
  • Test reading with different parameters
    • skipinitialspace
    • engine
    • delim_whitespace
    • escapechar
  • Select separator - normal one or regex
    • sep='\s+'
    • sep='\r\t'

Data

Suppose we would like to read the following csv file into Pandas DataFrame: "data.csv":

Date   Company A   Company A   Company B   Company B
2021-09-06   1   7.9   2   6
2021-09-07   1   8.5   2   7
2021-09-08   2   8   1   8.1

Example - regex separator - sep='\s{3}'

To correctly read data into DataFrame we need to use a combination of arguments: sep='\s{3}', engine='python'.

This is needed because we have exactly 3 spaces as delimiter:

import pandas as pd
df = pd.read_csv('data.csv', sep='\s{3}', engine='python')

The resulted DataFrame is:

Date Company A Company A.1 Company B Company B.1
0 2021-09-06 1 7.9 2 6.0
1 2021-09-07 1 8.5 2 7.0
2 2021-09-08 2 8.0 1 8.1

Note: If we try to use:

  • pd.read_csv('data.csv') - return single column
  • pd.read_csv('data.csv', sep='\s+') - column names contain spaces
  • pd.read_csv('data.csv', sep='\s*') - column names contain spaces

all of them will return wrong columns.

Example - regex separator - two and more spaces

We can also define use a regex to define separators as follow: two and more spaces: sep=r"[ ]{2,}"

import pandas as pd
df = pd.read_csv('data.csv', sep='[ ]{2,}', engine='python')

Alternative solution might be: sep='\t\s+' combination of tabs and spaces as separator in Pandas.

Note: Why do we need engine='python'? This is explained from the warning:

Falling back to the 'python' engine because the 'c' engine does not support regex separators

Additional notes:

  • delim_whitespace is equivalent to sep='\s+'. If True sep can be skipped
  • skipinitialspace - skip spaces after delimiter
  • Separators longer than 1 character and different from '\s+'
    • will be interpreted as regular expressions
    • also force the use of the Python parsing engine.
  • Always analyze and clean spaces because can cause issues:
    • "hello " != " hello"
    • how values of empty spaces should be treated - errors, missing values etc?

Output

Resources