A practical Pandas Cheat Sheet: Data Cleaning useful for everyday working with data. This Pandas cheat sheet contains ready-to-use codes and steps for data cleaning.
The cheat sheet aggregate the most common operations used in Pandas for: analyzing, fixing, removing - incorrect, duplicate or wrong data.
This cheat sheet will act as a guide for data science beginners and help them with various fundamentals of data cleaning. Experienced users can use it as a quick reference.
- Data Cleaning Steps with Python and Pandas
- Data Cleaning - Kaggle course
- Working with missing data - Pandas Docs
- Data Cleaning Steps - kaggle
EDA
df.info()
df.describe()
df.head(7)
df.sample(2)
df.shape
df.columns
Duplicates
df.diet.nunique()
df.diet.unique()
df['col_1'].value_counts(dropna=False)
df.duplicated(keep='last')
df.drop_duplicates(subset=['col_1'])
df[df.duplicated(keep=False)].index
Missing values
df.isna()
df['col_1'].notna()
df.isna().all()
s[s == True]
df.isna().any()
df['col_1'].fillna(0)
import seaborn as sns
sns.heatmap(df.isna(),cmap = 'Greens')
s.loc[0] = None
s.loc[0] = np.nan
df.dropna(axis=0)
df.dropna(axis=1, how='any')
Outliers
df['col_1'].describe()
import seaborn as sns
sns.boxplot(data=df[['col_1', 'col_2']])
q_low = df['col'].quantile(0.01)
q_hi = df['col'].quantile(0.99)
df[(df['col'] < q_hi)&(df['col'] > q_low)]
import numpy as np
ab = np.abs(df['col']-df['col'].mean())
std = (3*df['col'].std())
df[ab <= std ]
Wrong data
df[df['col_1'].str.contains(r'[@#&$%+-/*]')]
df[df['col_1'].map(lambda x: x.isascii())]
df['col']\
.loc[~df['col'].str.match(r'[0-9.]+')]
import numpy as np
np.where(df['col']=='',df['col2'],df['col'])
df[df['col_1'].str.contains('[A-Za-z]')]
df.applymap(np.isreal)
df['city'].str.len().value_counts()
Wrong format
df.apply(pd.to_numeric,errors='coerce')\
.isna().any()
pd.to_datetime(df['date_col'],errors='coerce')
import pandas_dedupe
dd_df = pandas_dedupe.dedupe_dataframe(
df, field_properties=['col1', 'col2'],
canonicalize=['col1'], sample_size=0.8
)
from difflib import get_close_matches
w = ['apes', 'apple', 'peach', 'puppy']
get_close_matches('ape', w, n=3, cutoff=0.8)
Fix errors
df.convert_dtypes()
df.astype({'col_1': 'int32'})
df.fillna(method='ffill')
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)
df.ffill(axis = 0)
Replacing
df['col'] = df['col'].str.replace(' M', '')
df['col'].str.replace(' M', '')\
.fillna(0).astype(int)
df['col'].str.replace('A7', '7', regex=False)
df.replace(r'\r+|\n+|\t+','', regex=True)
df['col'].str.replace('\s+', '', regex=True)
df['col'].str.rstrip('\r\n')
p = r'<[^<>]*>'
df['col1'].str.replace(p, '', regex=True)
Drop
df.drop('col_1', axis=1, inplace=True)
df.drop(['col1', 'col2'], axis=1)
df.dropna(axis=1, how='any')
df.drop(0)
df.drop([0, 2, 4])
df[(df['col1'] > 0) & (df['col2'] != 'open')]
df.reset_index()
Pandas cheat sheet: data cleaning
Data cleaning steps
Below you can find the data cleaning steps in order to ensure that your dataset is good for decisions:
- Are there obvious errors
- mixed data
- data without structured - different column lengths
- Is data consistent
- same units - km, m, etc
- same way - Paris, Par
- same data types - '0.1', 0,1
- Missing values
- how many
- reasons for missing values
- Same format
- DD/MM/YYYY, YY-MM-DD
- '15 M', '15:00'
- Duplicates
- duplicates in rows and values
- drop duplicates
- dedupe - keep only one - Par / Paris / Paris (Fra)
- Outliers
- detect and remove outliers
- Data bias
- is data biased
- does it represent the whole population or part of it
- how data was selected
- Data noise
- unwanted data items and values
- invalid data
- Data leakage
- use information from outside of the training model
Note: Please add ideas and suggestions in the comments below. Thanks you 💕