A practical Pandas Cheat Sheet: Data Cleaning useful for everyday working with data. This Pandas cheat sheet contains ready-to-use codes and steps for data cleaning.
The cheat sheet aggregate the most common operations used in Pandas for: analyzing, fixing, removing - incorrect, duplicate or wrong data.
This cheat sheet will act as a guide for data science beginners and help them with various fundamentals of data cleaning. Experienced users can use it as a quick reference.
- Data Cleaning Steps with Python and Pandas
- Data Cleaning - Kaggle course
- Working with missing data - Pandas Docs
- Data Cleaning Steps - kaggle
EDA

df.info()df.describe()df.head(7)df.sample(2)df.shapedf.columnsDuplicates

df.diet.nunique()df.diet.unique()df['col_1'].value_counts(dropna=False)df.duplicated(keep='last')df.drop_duplicates(subset=['col_1'])df[df.duplicated(keep=False)].indexMissing values

df.isna()df['col_1'].notna()df.isna().all()
s[s == True]df.isna().any()df['col_1'].fillna(0)import seaborn as sns
sns.heatmap(df.isna(),cmap = 'Greens')s.loc[0] = None
s.loc[0] = np.nandf.dropna(axis=0)df.dropna(axis=1, how='any')Outliers

df['col_1'].describe()import seaborn as sns
sns.boxplot(data=df[['col_1', 'col_2']])q_low = df['col'].quantile(0.01)
q_hi = df['col'].quantile(0.99)
df[(df['col'] < q_hi)&(df['col'] > q_low)]import numpy as np
ab = np.abs(df['col']-df['col'].mean())
std = (3*df['col'].std())
df[ab <= std ]Wrong data

df[df['col_1'].str.contains(r'[@#&$%+-/*]')]df[df['col_1'].map(lambda x: x.isascii())]df['col']\
.loc[~df['col'].str.match(r'[0-9.]+')]import numpy as np
np.where(df['col']=='',df['col2'],df['col'])df[df['col_1'].str.contains('[A-Za-z]')]df.applymap(np.isreal)df['city'].str.len().value_counts()Wrong format

df.apply(pd.to_numeric,errors='coerce')\
.isna().any()pd.to_datetime(df['date_col'],errors='coerce')import pandas_dedupe
dd_df = pandas_dedupe.dedupe_dataframe(
df, field_properties=['col1', 'col2'],
canonicalize=['col1'], sample_size=0.8
)from difflib import get_close_matches
w = ['apes', 'apple', 'peach', 'puppy']
get_close_matches('ape', w, n=3, cutoff=0.8)Fix errors

df.convert_dtypes()df.astype({'col_1': 'int32'})df.fillna(method='ffill')values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)df.ffill(axis = 0)Replacing

df['col'] = df['col'].str.replace(' M', '')df['col'].str.replace(' M', '')\
.fillna(0).astype(int)df['col'].str.replace('A7', '7', regex=False)df.replace(r'\r+|\n+|\t+','', regex=True)df['col'].str.replace('\s+', '', regex=True)df['col'].str.rstrip('\r\n')p = r'<[^<>]*>'
df['col1'].str.replace(p, '', regex=True)Drop

df.drop('col_1', axis=1, inplace=True)df.drop(['col1', 'col2'], axis=1)df.dropna(axis=1, how='any')df.drop(0)df.drop([0, 2, 4])df[(df['col1'] > 0) & (df['col2'] != 'open')]df.reset_index()Pandas cheat sheet: data cleaning

Data cleaning steps
Below you can find the data cleaning steps in order to ensure that your dataset is good for decisions:
- Are there obvious errors
- mixed data
- data without structured - different column lengths
- Is data consistent
- same units - km, m, etc
- same way - Paris, Par
- same data types - '0.1', 0,1
- Missing values
- how many
- reasons for missing values
- Same format
- DD/MM/YYYY, YY-MM-DD
- '15 M', '15:00'
- Duplicates
- duplicates in rows and values
- drop duplicates
- dedupe - keep only one - Par / Paris / Paris (Fra)
- Outliers
- detect and remove outliers
- Data bias
- is data biased
- does it represent the whole population or part of it
- how data was selected
- Data noise
- unwanted data items and values
- invalid data
- Data leakage
- use information from outside of the training model
Note: Please add ideas and suggestions in the comments below. Thanks you 💕
