In this short guide, I'll show you** how to solve the error: UnicodeDecodeError: invalid start byte while reading a CSV with Pandas**:

pandas UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6785: invalid start byte

The error might have several different reasons:

  • different encoding
  • bad symbols
  • corrupted file

Below you can find quick solution for this error: pandas UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte.

df = pd.read_csv('file', encoding='utf-16')

Adding encoding='utf-16' to Pandas read_csv() will solve the error.

In the next steps you will find information on how to investigate and solve the error.

As always all examples can be found in a handy: Jupyter Notebook

1: UnicodeDecodeError: invalid start byte while reading CSV file

To start, let's demonstrate the error: UnicodeDecodeError while reading a sample CSV file with Pandas.

The file content is shown below by Linux command cat:

��a,b,c
1,2,3

We can see some strange symbol at the file start: ��

Using method read_csv on the file above will raise error:

df = pd.read_csv('../data/csv/file_utf-16.csv')

raised error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

2: Solution of UnicodeDecodeError: change read_csv encoding

The first solution which can be applied in order to solve the error UnicodeDecodeError is to change the encoding for method read_csv.

To use different encoding we can use parameter: encoding:

df = pd.read_csv('../data/csv/file_utf-16.csv', encoding='utf-16')

and the file will be read correctly.

The weird start of the file was suggesting that probably the encoding is not utf-8.

In order to check what is the correct encoding of the CSV file we can use next Linux command or Jupyter magic:

!file '../data/csv/file_utf-16.csv'

this will give us:

../data/csv/file_utf-16.csv: Little-endian UTF-16 Unicode text

Another popular encodings are:

  • cp1252
  • iso-8859-1
  • latin1

Python has option to check file encoding but it may be wrong in some cases like:

with open('../data/csv/file_utf-16.csv') as f:
    print(f)

result:

<_io.TextIOWrapper name='../data/csv/file_utf-16.csv' mode='r' encoding='UTF-8'>

3: Solution of UnicodeDecodeError: skip encoding errors with encoding_errors='ignore'

Pandas read_csv has a parameter - encoding_errors='ignore' which defines how encoding errors are treated - to be skipped or raised.

The parameter is described as:

How encoding errors are treated.

Note: Important change in the new versions of Pandas:

Changed in version 1.3.0: encoding_errors is a new argument. encoding has no longer an influence on how encoding errors are handled.

Let's demonstrate how parameter of read_csv - encoding_errors works:

from pathlib import Path
import pandas as pd

file = Path('../data/csv/file_utf-8.csv')
file.write_bytes(b"\xe4\na\n1")  # non utf-8 character

df = pd.read_csv(file, encoding_errors='ignore')

The above will result into:

a
1

To prevent Pandas read_csv reading incorrect CSV data due to encoding use: encoding_errors='strinct' - which is the default behavior:

df = pd.read_csv(file, encoding_errors='strict')

This will raise an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: invalid continuation byte

Another possible encoding error which can be raised by the same parameter is:

Pandas UnicodeEncodeError: 'charmap' codec can't encode character

4: Solution of UnicodeDecodeError: fix encoding errors with unicode_escape

The final solution to fix encoding errors like:

  • UnicodeDecodeError
  • UnicodeEncodeError

is by using option unicode_escape. It can be described as:

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decode from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.

Pandas read_csv and encoding can be used 'unicode_escape' as:

df = pd.read_csv(file, encoding='unicode_escape')

to prevent encoding errors.

Resources