Detect and Fix Errors in pd.to_datetime() in Pandas
In this article we will see how to solve errors related to pd.to_datetime()
in Pandas.
Typical errors for datetime conversion in Pandas are:
ParserError: Unknown string format: 27-02-2022 sdsd6:25:00 PM
AttributeError: Can only use .dt accessor with datetimelike values
ValueError: only leading negative signs are allowed
ValueError: time data '1975-02-23T02:58:41.000Z' does not match format '%H:%M:%S' (match)
This could be due to:
- mixed dates and time formats in single column
- wrong date format
- datetime conversion with errors
Setup
In this tutorial we will work with this dataset from Kaggle: earthquake-database.
As we can see from the image above there are errors related to mixed formats in the dataset:
Detect wrong dates in Pandas - errors='coerce'
First we will try to detect the wrong date of time format by using the date parameter - errors='coerce'
.
So after looking at the data we know that column "Time" has the following format: '%H:%M:%S'
.
Now we can try to parse all times in this column and extract only the ones which are parsed with errors:
df[pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce').isna()]['Time']
the result is 3 rows with wrong format:
3378 1975-02-23T02:58:41.000Z
7512 1985-04-28T02:53:41.530Z
20650 2011-03-13T02:23:34.520Z
Name: Time, dtype: object
Detect mixed time formats - str.len()
Let's see how to detect problematic values in the DateTime column of Pandas DataFrame. We can find the wrong date format by using string methods.
We will count the length of each date time:
df['Time'].str.len().value_counts()
the results show us 3 problematic rows:
8 23409
24 3
Name: Time, dtype: int64
To extract the wrong or different ones we can do:
df[df['Time'].str.len() > 8]['Time']
Which will give us:
3378 1975-02-23T02:58:41.000Z
7512 1985-04-28T02:53:41.530Z
20650 2011-03-13T02:23:34.520Z
Name: Time, dtype: object
To exclude rows with different rows we can do:
df = df[df['Date'].str.len() == 8]
This way is better for for working with date or time formats like:
- HH:MM:SS
- dd/mm/YYYY
Infer date and time format from datetime
This option is best when we need to work with formats which has date and time like:
- '%Y-%m-%dT%H:%M:%S.%f'
- '%Y-%m-%dT%H:%M:%S'
First we will find the most frequent format in the column by:
import numpy as np
from pandas.core.tools.datetimes import _guess_datetime_format_for_array
array = np.array(df["Date"].to_list())
_guess_datetime_format_for_array(array)
This would give us:
'%m/%d/%Y'
Next we will try to convert the whole column with this format:
pd.to_datetime(df["Date"], format='%m/%d/%Y')
This will result into error:
ValueError: time data '1975-02-23T02:58:41.000Z' does not match format '%m/%d/%Y' (match)
Now we can exclude rows with this format or convert them with different format:
More errors related to pd.to_datetime()
Semantic date errors in Pandas
Sometimes there isn't a code error but the date is wrong. This is when there are parsing errors. For example day and month are wrongly inferred or there are two date formats. To find more to this problem check:
How to Fix Pandas to_datetime: Wrong Date and Errors
Convert string to datetime
If you want to find more about convert string to datetime and infer date formats you can check: Convert String to DateTime in Pandas