How to Drop Bad Lines with read_csv in Pandas

Here are two approaches to drop bad lines with read_csv in Pandas:

(1) Parameter on_bad_lines='skip' - Pandas >= 1.3

df = pd.read_csv(csv_file, delimiter=';', on_bad_lines='skip')

(2) error_bad_lines=False - Pandas < 1.3

df = pd.read_csv(csv_file, delimiter=';', error_bad_lines=False)

Suppose we have two files:

  • Single separator ;

    Date;Company A;Company A;Company B;Company B
    2021-09-06;1;7.9;2;6
    2021-09-07;1;8.5;2;7
    2021-09-08;2;8;1;8.1
    2021-09-09;2;8;1;"8.3;5.5"

  • Double separator ;;

    Date;;Company A;;Company A;;Company B;;Company B
    2021-09-06;;1;;7.9;;2;;6
    2021-09-07;;1;;8.5;;2;;7
    2021-09-08;;2;;8;;1;;8.1
    2021-09-09;;2;;8;;1;;"8.3;;5.5"

Since Pandas 1.3 parameters error_bad_lines and warn_bad_linesbool will be deprecated.

Instead of them there will be new separator:

on_bad_lines{‘error’, ‘warn’, ‘skip’}, default ‘error’

  • Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are :

    • error, raise an Exception when a bad line is encountered.
    • warn, raise a warning when a bad line is encountered and skip that line.
    • skip, skip bad lines without raising or warning when they are encountered.

Note that depending on the separator:

  • single
  • multiple
  • regex

The read_csv behavior can be different. You can check this article for more information: How to Use Multiple Char Separator in read_csv in Pandas

The reason for this is described in the documentation of Pandas:

Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.

For example for a single separator ";" this code will work fine:

Date Company A Company A.1 Company B Company B.1
2021-09-06 1 7.9 2 6
2021-09-07 1 8.5 2 7
2021-09-08 2 8.0 1 8.1
2021-09-09 2 8.0 1 8.3;5.5

While if you have a file with two separators you will get error or warning (depending on on_bad_lines) for the quoted line:

csv_file = '../data/csv/multine_bad_line_multi_sep.csv'

df = pd.read_csv(csv_file, delimiter=';;', engine='python', on_bad_lines='warn')

Skipping line 5: Expected 5 fields in line 5, saw 6. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

In this case only 3 rows will be read from the CSV file

Date Company A Company A.1 Company B Company B.1
2021-09-06 1 7.9 2 6.0
2021-09-07 1 8.5 2 7.0
2021-09-08 2 8.0 1 8.1

Resources