A practical Pandas Cheat Sheet: Data Cleaning useful for everyday working with data. This Pandas cheat sheet contains ready-to-use codes and steps for data cleaning.
The cheat sheet aggregate the most common operations used in Pandas for: analyzing, fixing, removing - incorrect, duplicate or wrong data.
This cheat sheet will act as a guide for data science beginners and help them with various fundamentals of data cleaning. Experienced users can use it as a quick reference.
- Data Cleaning Steps with Python and Pandas
- Data Cleaning - Kaggle course
- Working with missing data - Pandas Docs
- Data Cleaning Steps - kaggle
EDA
Exploratory Data Analysis
![](https://datascientyst.com/content/images/2022/10/eda.png)
df.info()
DataFrame columns, dtypes and memory
df.describe()
Returns columns coverage and types
df.head(7)
Returns first N rows
df.sample(2)
return random samples
df.shape
return DataFrame dimensions
df.columns
returns DataFrame columns
Duplicates
Detect and Remove duplicates
![](https://datascientyst.com/content/images/2022/10/duplicates.png)
df.diet.nunique()
number of unique values in column
df.diet.unique()
unique values in column
df['col_1'].value_counts(dropna=False)
return series of unique values and counts in column
df.duplicated(keep='last')
find duplicates and keep only the last record
df.drop_duplicates(subset=['col_1'])
drop duplicates from column(s)
df[df.duplicated(keep=False)].index
get indexes of all detected duplications:
Missing values
Working with missing data
![](https://datascientyst.com/content/images/2022/10/missing.png)
df.isna()
return True or False for missing values
df['col_1'].notna()
return True or False for non-NA data
df.isna().all()
s[s == True]
Columns which contains only NaN values
df.isna().any()
Detect columns with NaN values
df['col_1'].fillna(0)
Fill NaN with string or 0
import seaborn as sns
sns.heatmap(df.isna(),cmap = 'Greens')
plot missing values
s.loc[0] = None
s.loc[0] = np.nan
Insert missing data
df.dropna(axis=0)
droping rows with missing data
df.dropna(axis=1, how='any')
Drop columns with NaN values
Outliers
Detect and remove outliers
![](https://datascientyst.com/content/images/2022/10/outliers-1.png)
df['col_1'].describe()
detecting outliers with describe()
import seaborn as sns
sns.boxplot(data=df[['col_1', 'col_2']])
detect outliers with boxplot
q_low = df['col'].quantile(0.01)
q_hi = df['col'].quantile(0.99)
df[(df['col'] < q_hi)&(df['col'] > q_low)]
remove outliers with quantiles
import numpy as np
ab = np.abs(df['col']-df['col'].mean())
std = (3*df['col'].std())
df[ab <= std ]
remove outliers with standart deviation
Wrong data
Detect wrong data
![](https://datascientyst.com/content/images/2022/10/wrongdata.png)
df[df['col_1'].str.contains(r'[@#&$%+-/*]')]
Detect special symbols
df[df['col_1'].map(lambda x: x.isascii())]
Detect (non) ascii characters
df['col']\
.loc[~df['col'].str.match(r'[0-9.]+')]
find pattern with regex
import numpy as np
np.where(df['col']=='',df['col2'],df['col'])
detect empty spaces
df[df['col_1'].str.contains('[A-Za-z]')]
Detect latin symbols
df.applymap(np.isreal)
detect non numeric rows
df['city'].str.len().value_counts()
Count values by lenght
Wrong format
Detect wrong format
![](https://datascientyst.com/content/images/2022/10/wrongformat.png)
df.apply(pd.to_numeric,errors='coerce')\
.isna().any()
detect wrong numeric format
pd.to_datetime(df['date_col'],errors='coerce')
detect wrong datetime format
import pandas_dedupe
dd_df = pandas_dedupe.dedupe_dataframe(
df, field_properties=['col1', 'col2'],
canonicalize=['col1'], sample_size=0.8
)
Find typos and misspelling - Deduplication and canonicalization with pandas_dedupe
from difflib import get_close_matches
w = ['apes', 'apple', 'peach', 'puppy']
get_close_matches('ape', w, n=3, cutoff=0.8)
Use difflib to find close matches
Fix errors
Fix errors in Pandas
![](https://datascientyst.com/content/images/2022/10/fixerrors.png)
df.convert_dtypes()
Convert the DataFrame to use best possible dtypes
df.astype({'col_1': 'int32'})
Cast col_1 to int32 using a dictionary
df.fillna(method='ffill')
Propagate non-null values forward or backward
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)
Replace all NaN elements in column with dict
df.ffill(axis = 0)
fill the missing values row wise
Replacing
Replace data in DataFrame
![](https://datascientyst.com/content/images/2022/10/replace.png)
df['col'] = df['col'].str.replace(' M', '')
replace string from column
df['col'].str.replace(' M', '')\
.fillna(0).astype(int)
replace and convert column to integer
df['col'].str.replace('A7', '7', regex=False)
Replace values in column - no regex
df.replace(r'\r+|\n+|\t+','', regex=True)
Find and replace line breaks - new line, tab - regex
df['col'].str.replace('\s+', '', regex=True)
Replace multiple white spaces
df['col'].str.rstrip('\r\n')
Replace line breaks from the right
p = r'<[^<>]*>'
df['col1'].str.replace(p, '', regex=True)
Replace HTML tags
Drop
Drop rows, columns, index, condition
![](https://datascientyst.com/content/images/2022/10/drop.png)
df.drop('col_1', axis=1, inplace=True)
Drop one column by name
df.drop(['col1', 'col2'], axis=1)
Drop multiple columns by name
df.dropna(axis=1, how='any')
Drop columns with NaN values
df.drop(0)
Drop rows by index - 0
df.drop([0, 2, 4])
drop multiple rows
df[(df['col1'] > 0) & (df['col2'] != 'open')]
drop rows by condition
df.reset_index()
drop index
Pandas cheat sheet: data cleaning
Data cleaning steps
Below you can find the data cleaning steps in order to ensure that your dataset is good for decisions:
- Are there obvious errors
- mixed data
- data without structured - different column lengths
- Is data consistent
- same units - km, m, etc
- same way - Paris, Par
- same data types - '0.1', 0,1
- Missing values
- how many
- reasons for missing values
- Same format
- DD/MM/YYYY, YY-MM-DD
- '15 M', '15:00'
- Duplicates
- duplicates in rows and values
- drop duplicates
- dedupe - keep only one - Par / Paris / Paris (Fra)
- Outliers
- detect and remove outliers
- Data bias
- is data biased
- does it represent the whole population or part of it
- how data was selected
- Data noise
- unwanted data items and values
- invalid data
- Data leakage
- use information from outside of the training model
Note: Please add ideas and suggestions in the comments below. Thanks you 💕
![](https://datascientyst.com/content/images/2022/10/outliers-1.png)