How to Drop Bad Lines with read_csv in Pandas
Here are two approaches to drop bad lines with read_csv
in Pandas:
(1) Parameter on_bad_lines='skip' - Pandas >= 1.3
df = pd.read_csv(csv_file, delimiter=';', on_bad_lines='skip')
(2) error_bad_lines=False - Pandas < 1.3
df = pd.read_csv(csv_file, delimiter=';', error_bad_lines=False)
Suppose we have two files:
-
Single separator
;
Date;Company A;Company A;Company B;Company B
2021-09-06;1;7.9;2;6
2021-09-07;1;8.5;2;7
2021-09-08;2;8;1;8.1
2021-09-09;2;8;1;"8.3;5.5" -
Double separator
;;
Date;;Company A;;Company A;;Company B;;Company B
2021-09-06;;1;;7.9;;2;;6
2021-09-07;;1;;8.5;;2;;7
2021-09-08;;2;;8;;1;;8.1
2021-09-09;;2;;8;;1;;"8.3;;5.5"
Since Pandas 1.3 parameters error_bad_lines and warn_bad_linesbool will be deprecated.
Instead of them there will be new separator:
on_bad_lines{‘error’, ‘warn’, ‘skip’}, default ‘error’
-
Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are :
error
, raise an Exception when a bad line is encountered.warn
, raise a warning when a bad line is encountered and skip that line.skip
, skip bad lines without raising or warning when they are encountered.
Note that depending on the separator:
- single
- multiple
- regex
The read_csv behavior can be different. You can check this article for more information: How to Use Multiple Char Separator in read_csv in Pandas
The reason for this is described in the documentation of Pandas:
Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
For example for a single separator ";" this code will work fine:
Date | Company A | Company A.1 | Company B | Company B.1 |
---|---|---|---|---|
2021-09-06 | 1 | 7.9 | 2 | 6 |
2021-09-07 | 1 | 8.5 | 2 | 7 |
2021-09-08 | 2 | 8.0 | 1 | 8.1 |
2021-09-09 | 2 | 8.0 | 1 | 8.3;5.5 |
While if you have a file with two separators you will get error or warning (depending on on_bad_lines
) for the quoted line:
csv_file = '../data/csv/multine_bad_line_multi_sep.csv'
df = pd.read_csv(csv_file, delimiter=';;', engine='python', on_bad_lines='warn')
Skipping line 5: Expected 5 fields in line 5, saw 6. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
In this case only 3 rows will be read from the CSV file
Date | Company A | Company A.1 | Company B | Company B.1 |
---|---|---|---|---|
2021-09-06 | 1 | 7.9 | 2 | 6.0 |
2021-09-07 | 1 | 8.5 | 2 | 7.0 |
2021-09-08 | 2 | 8.0 | 1 | 8.1 |