Pandas Cheat Sheet: Data Cleaning

A practical Pandas Cheat Sheet: Data Cleaning useful for everyday working with data. This Pandas cheat sheet contains ready-to-use codes and steps for data cleaning.

The cheat sheet aggregate the most common operations used in Pandas for: analyzing, fixing, removing - incorrect, duplicate or wrong data.

This cheat sheet will act as a guide for data science beginners and help them with various fundamentals of data cleaning. Experienced users can use it as a quick reference.

EDA

Exploratory Data Analysis

df.info()

DataFrame columns, dtypes and memory

df.describe()

Returns columns coverage and types

df.head(7)

Returns first N rows

df.sample(2)

return random samples

df.shape

return DataFrame dimensions

df.columns

returns DataFrame columns

Duplicates

Detect and Remove duplicates

df.diet.nunique()

number of unique values in column

df.diet.unique()

unique values in column

df['col_1'].value_counts(dropna=False)

return series of unique values and counts in column

df.duplicated(keep='last')

find duplicates and keep only the last record

df.drop_duplicates(subset=['col_1'])

drop duplicates from column(s)

df[df.duplicated(keep=False)].index

get indexes of all detected duplications:

Missing values

Working with missing data

df.isna()

return True or False for missing values

df['col_1'].notna()

return True or False for non-NA data

df.isna().all()
s[s == True]

Columns which contains only NaN values

df.isna().any()

Detect columns with NaN values

df['col_1'].fillna(0)

Fill NaN with string or 0

import seaborn as sns
sns.heatmap(df.isna(),cmap = 'Greens')

plot missing values

s.loc[0] = None
s.loc[0] = np.nan

Insert missing data

df.dropna(axis=0)

droping rows with missing data

df.dropna(axis=1, how='any')

Drop columns with NaN values

Outliers

Detect and remove outliers

df['col_1'].describe()

detecting outliers with describe()

import seaborn as sns
sns.boxplot(data=df[['col_1', 'col_2']])

detect outliers with boxplot

q_low = df['col'].quantile(0.01)
q_hi  = df['col'].quantile(0.99)
df[(df['col'] < q_hi)&(df['col'] > q_low)]

remove outliers with quantiles

import numpy as np
ab = np.abs(df['col']-df['col'].mean())
std = (3*df['col'].std())
df[ab <= std ]

remove outliers with standart deviation

Wrong data

Detect wrong data

df[df['col_1'].str.contains(r'[@#&$%+-/*]')]

Detect special symbols

df[df['col_1'].map(lambda x: x.isascii())]

Detect (non) ascii characters

df['col']\
    .loc[~df['col'].str.match(r'[0-9.]+')]

find pattern with regex

import numpy as np
np.where(df['col']=='',df['col2'],df['col'])

detect empty spaces

df[df['col_1'].str.contains('[A-Za-z]')]

Detect latin symbols

df.applymap(np.isreal)

detect non numeric rows

df['city'].str.len().value_counts()

Count values by lenght

Wrong format

Detect wrong format

df.apply(pd.to_numeric,errors='coerce')\
    .isna().any()

detect wrong numeric format

pd.to_datetime(df['date_col'],errors='coerce')

detect wrong datetime format

import pandas_dedupe
dd_df = pandas_dedupe.dedupe_dataframe(
    df, field_properties=['col1', 'col2'],
    canonicalize=['col1'], sample_size=0.8
)

Find typos and misspelling - Deduplication and canonicalization with pandas_dedupe

from difflib import get_close_matches
w =  ['apes', 'apple', 'peach', 'puppy']
get_close_matches('ape', w, n=3, cutoff=0.8)

Use difflib to find close matches

Fix errors

Fix errors in Pandas

df.convert_dtypes()

Convert the DataFrame to use best possible dtypes

df.astype({'col_1': 'int32'})

Cast col_1 to int32 using a dictionary

df.fillna(method='ffill')

Propagate non-null values forward or backward

values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)

Replace all NaN elements in column with dict

df.ffill(axis = 0)

fill the missing values row wise

Replacing

Replace data in DataFrame

df['col'] = df['col'].str.replace(' M', '')

replace string from column

df['col'].str.replace(' M', '')\
    .fillna(0).astype(int)

replace and convert column to integer

df['col'].str.replace('A7', '7', regex=False)

Replace values in column - no regex

df.replace(r'\r+|\n+|\t+','', regex=True)

Find and replace line breaks - new line, tab - regex

df['col'].str.replace('\s+', '', regex=True)

Replace multiple white spaces

df['col'].str.rstrip('\r\n')

Replace line breaks from the right

p = r'<[^<>]*>'
df['col1'].str.replace(p, '', regex=True)

Replace HTML tags

Drop

Drop rows, columns, index, condition

df.drop('col_1', axis=1, inplace=True)

Drop one column by name

df.drop(['col1', 'col2'], axis=1)

Drop multiple columns by name

df.dropna(axis=1, how='any')

Drop columns with NaN values

df.drop(0)

Drop rows by index - 0

df.drop([0, 2, 4])

drop multiple rows

df[(df['col1'] > 0) & (df['col2'] != 'open')]

drop rows by condition

df.reset_index()

drop index

Pandas cheat sheet: data cleaning

Data cleaning steps

Below you can find the data cleaning steps in order to ensure that your dataset is good for decisions:

Are there obvious errors
- mixed data
- data without structured - different column lengths
Is data consistent
- same units - km, m, etc
- same way - Paris, Par
- same data types - '0.1', 0,1
Missing values
- how many
- reasons for missing values
Same format
- DD/MM/YYYY, YY-MM-DD
- '15 M', '15:00'
Duplicates
- duplicates in rows and values
- drop duplicates
- dedupe - keep only one - Par / Paris / Paris (Fra)
Outliers
- detect and remove outliers
Data bias
- is data biased
- does it represent the whole population or part of it
- how data was selected
Data noise
- unwanted data items and values
- invalid data
Data leakage
- use information from outside of the training model

Note: Please add ideas and suggestions in the comments below. Thanks you 💕