How to Fix - UnicodeDecodeError: invalid start byte - during read_csv in Pandas
In this short guide, I'll show you** how to solve the error: UnicodeDecodeError: invalid start byte while reading a CSV with Pandas**:
pandas UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6785: invalid start byte
The error might have several different reasons:
- different encoding
- bad symbols
- corrupted file
Below you can find quick solution for this error: pandas UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte.
df = pd.read_csv('file', encoding='utf-16')
Adding encoding='utf-16'
to Pandas read_csv() will solve the error.
In the next steps you will find information on how to investigate and solve the error.
As always all examples can be found in a handy: Jupyter Notebook
1: UnicodeDecodeError: invalid start byte while reading CSV file
To start, let's demonstrate the error: UnicodeDecodeError while reading a sample CSV file with Pandas.
The file content is shown below by Linux command cat
:
��a,b,c
1,2,3
We can see some strange symbol at the file start: ��
Using method read_csv
on the file above will raise error:
df = pd.read_csv('../data/csv/file_utf-16.csv')
raised error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
2: Solution of UnicodeDecodeError: change read_csv encoding
The first solution which can be applied in order to solve the error UnicodeDecodeError
is to change the encoding for method read_csv
.
To use different encoding we can use parameter: encoding
:
df = pd.read_csv('../data/csv/file_utf-16.csv', encoding='utf-16')
and the file will be read correctly.
The weird start of the file was suggesting that probably the encoding is not utf-8.
In order to check what is the correct encoding of the CSV file we can use next Linux command or Jupyter magic:
!file '../data/csv/file_utf-16.csv'
this will give us:
../data/csv/file_utf-16.csv: Little-endian UTF-16 Unicode text
Another popular encodings are:
cp1252
iso-8859-1
latin1
Python has option to check file encoding but it may be wrong in some cases like:
with open('../data/csv/file_utf-16.csv') as f:
print(f)
result:
<_io.TextIOWrapper name='../data/csv/file_utf-16.csv' mode='r' encoding='UTF-8'>
3: Solution of UnicodeDecodeError: skip encoding errors with encoding_errors='ignore'
Pandas read_csv
has a parameter - encoding_errors='ignore'
which defines how encoding errors are treated - to be skipped or raised.
The parameter is described as:
How encoding errors are treated.
Note: Important change in the new versions of Pandas:
Changed in version 1.3.0: encoding_errors is a new argument. encoding has no longer an influence on how encoding errors are handled.
Let's demonstrate how parameter of read_csv
- encoding_errors
works:
from pathlib import Path
import pandas as pd
file = Path('../data/csv/file_utf-8.csv')
file.write_bytes(b"\xe4\na\n1") # non utf-8 character
df = pd.read_csv(file, encoding_errors='ignore')
The above will result into:
a |
---|
1 |
To prevent Pandas read_csv
reading incorrect CSV data due to encoding use: encoding_errors='strinct'
- which is the default behavior:
df = pd.read_csv(file, encoding_errors='strict')
This will raise an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: invalid continuation byte
Another possible encoding error which can be raised by the same parameter is:
Pandas UnicodeEncodeError: 'charmap' codec can't encode character
4: Solution of UnicodeDecodeError: fix encoding errors with unicode_escape
The final solution to fix encoding errors like:
- UnicodeDecodeError
- UnicodeEncodeError
is by using option unicode_escape
. It can be described as:
Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decode from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.
Pandas read_csv
and encoding can be used 'unicode_escape'
as:
df = pd.read_csv(file, encoding='unicode_escape')
to prevent encoding errors.