Here is the way to use multiple separators (regex separators) with read_csv in Pandas:
df = pd.read_csv(csv_file, sep=';;', engine='python')
Suppose we have a CSV file with the next data:
Date;;Company A;;Company A;;Company B;;Company B 2021-09-06;;1;;7.9;;2;;6 2021-09-07;;1;;8.5;;2;;7 2021-09-08;;2;;8;;1;;8.1 multine_separators
As you can see there are multiple separators between the values -
;;. In order to read this we need to specify that as a parameter -
Parameter delimiter is alias for sep
From the official docs:
sepstr, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer.
In this post we are interested mainly in this part:
In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
If you try to read the above file without specifying the engine like:
df = pd.read_csv(csv_file, delimiter=';;')
you will get warning message like:
/home/vanx/PycharmProjects/datascientyst/venv/lib/python3.8/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
return func(*args, **kwargs)
So you need to specify the engine like:
df = pd.read_csv(csv_file, delimiter=';;', engine='python')
What does it mean:
Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
Let's add the following line to the CSV file:
If we try to read this file again we will get an error:
ParserError: Expected 5 fields in line 5, saw 6. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
You can skip lines which cause errors like the one above by using parameter:
on_bad_lines for Pandas > 1.3.
df = pd.read_csv(csv_file, delimiter=';;', engine='python', error_bad_lines=False)
Finally in order to use regex separator in Pandas: you can write:
df = pd.read_csv(csv_file, sep=r';+', engine='python')