Data Cleaning Steps with Python and Pandas

Often we may need to clean the data using Python and Pandas.

This tutorial explains the basic steps for data cleaning by example:

  • Basic exploratory data analysis
  • Detect and remove missing data
  • Drop unnecessary columns and rows
  • Detect outliers
  • Inconsistent data
  • Irrelevant features

What is Data Cleaning? What is dirty Data?

First let's see what is dirty data:

dirty data is inaccurate, incomplete or inconsistent data

The common features of dirty data are:

  • spelling or punctuation errors
  • incorrect data associated with a field
  • incomplete data
  • outdated data
  • duplicated records

The process of fixing all issues above is known as data cleaning or data cleansing.

Usually data cleaning process has several steps:

  • normalization (optional)
  • detect bad records
  • correct problematic values
  • remove irrelevant or inaccurate data
  • generate report (optional)

At the end of the process data should be:

  • complete
  • up to date
  • accurate
  • correct
  • consistent
  • relevant
  • normalized

For tidy data

  • each observation is saved in its own row
  • each variable is saved in its own column

Setup

In this post we will use data from Kaggle - A Short History of the Data-science.

Above you can find a notebook related to 2019 Kaggle Machine Learning & Data Science Survey.

To read the data you need to use the following code:

import kaggle

link = 'eswarankrishnasamy/2019-kaggle-machine-learning-data-science-survey'

kaggle.api.authenticate()
kaggle.api.dataset_download_file(link, file_name='multiple_choice_responses.csv',  path='data/')

The downloaded data can be ready by:

import pandas as pd
pd.read_csv('data/multiple_choice_responses.csv.zip', low_memory=False)

Data looks like:

Time from Start to Finish (seconds) Q1 Q2 Q2_OTHER_TEXT Q3
Duration (in seconds) What is your age (# years)? What is your gender? - Selected Choice What is your gender? - Prefer to self-describe - Text In which country do you currently reside?
510 22-24 Male -1 France
423 40-44 Male -1 India
83 55-59 Female -1 Germany
391 40-44 Male -1 Australia

Step 1: Exploratory data analysis in Python and Pandas

To start we can do basic exploratory data analysis in Pandas. This will show us more about data:

  • data types
  • shape and size
  • missing values
  • sample data

The first method is head() - which returns the first 5 rows of the dataset.

To see the first 5 rows and 5 columns we can do: df.iloc[0:5,0:5]

df.head()

The result is truncated for the first 5 columns:

Time from Start to Finish (seconds) Q1 Q2 Q2_OTHER_TEXT Q3
0 Duration (in seconds) What is your age (# years)? What is your gender? - Selected Choice What is your gender? - Prefer to self-describe - Text In which country do you currently reside?
1 510 22-24 Male -1 France
2 423 40-44 Male -1 India
3 83 55-59 Female -1 Germany
4 391 40-44 Male -1 Australia

Next we can see information about the number of the columns and rows by df.shape:

df.shape

The result is a tuple showing 19718 rows and 246 columns:

(19718, 246)

Similar information we can get by df.info():

df.info()

result:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19718 entries, 0 to 19717
Columns: 246 entries, Time from Start to Finish (seconds) to Q34_OTHER_TEXT
dtypes: object(246)
memory usage: 37.0+ MB

Finally we can get more details information about the data values by method describe(). This method will generate descriptive statistics (summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values).

df.describe()
Time from Start to Finish (seconds) Q1 Q2 Q2_OTHER_TEXT Q3
count 19718 19718 19718 19718 19718
unique 4169 12 5 46 60
top 450 25-29 Male -1 India
freq 42 4458 16138 19668 4786

Step 2: First rows as header read_csv in Pandas

So far we saw that the first row contains data which belongs to the header. We need to change how we read the data with header=[0,1]:

df = pd.read_csv('data/multiple_choice_responses.csv.zip', low_memory=False, header=[0,1])

The above will read the multiline header from the CSV file.

In order to simplify the reading of the data we can drop single level from the multi-index by:

df.droplevel(level=1, axis=1)

Step 3: Data tidying in Pandas

Next we can do data tidying because tidy data helps Pandas's vectorized
operations.

For example column 'Q1' looks like - we need to use the multi-index in order to read the column:

df[('Q1', 'What is your age (# years)?')]

resulted data is:

0        22-24
1        40-44
2        55-59
3        40-44
4        22-24

Can we split that into two columns? It looks like that all values are two numbers separated by '-' hyphen. The best is to confirm that observation by:

df[('Q1', 'What is your age (# years)?')].value_counts()

The last rows shows us one record - 70+ which needs special attention

45-49                           949
50-54                           692
55-59                           422
60-69                           338
70+                             100
dtype: int64

If we perform split operation on rows containing 70+ will result into:

df['Q1'].str.split('-', expand=True)

output

0     70+
1    None
Name: 182, dtype: object

Step 4: Correcting and replacing data in Pandas

Next we can see how to correct the data above. We can do data correction of cases 70+ in two ways:

  • replace value of 70+ with something else
  • split the column and fill the NaN values

4.1. Replace values in column - Pandas

To replace the values in the column we can use method .str.replace('70+', '70-120', regex=False) as follows:

df['Q1'].str.replace('70+', '70-120', regex=False)

4.2. Fill NaN with string or 0 - Pandas

The other option is to fill the missing values after the split by:

df['max_age'].fillna(120)

we suppose that after the split we created new column 'max_age'

Step 5: Detect NaN values in column Pandas

Now let's see how we can detect NaN values. This will help us drop columns with NaN values.

5.1 Columns which contains only NaN values

To find columns which has only NaN values we can use two methods:

  • isna()
  • all()
s = df.isna().all()

This will give a new Series with column name and True or False - depending on the NaN values. If a column has only NaN values we will get True.

To find columns which contain NaN values we can use:

s[s == True]

the result is:

Series([], dtype: bool)

There's no column which contains only NaN values

5.2 Detect columns with NaN values

To detect columns which has NaN values we can use:

df.isna().any()

This will result into:

Time from Start to Finish (seconds)    False
Q1                                     False
Q2                                     False
Q2_OTHER_TEXT                          False
Q3                                     False
                                       ...  
Q34_Part_9                              True
Q34_Part_10                             True
Q34_Part_11                             True
Q34_Part_12                             True
Q34_OTHER_TEXT                         False
Length: 246, dtype: bool

So columns like 'Q34_Part_9' have NaN values. Columns like 'Q1' don't have NaN values.

Step 6: Drop columns in Pandas

Let's say that we would like to drop columns based on name or NaN values. We can do that in several ways:

6.1 Drop one column by name

Parameters needed to drop columns are axis=1 and inplace=True - which means that operation will affect DataFrame.

df.drop('Q1', axis=1, inplace=True)

6.2 Drop multiple columns by name

We can list several column which to be removed by:

df.drop(['Q1', 'Q2'], axis=1, inplace=True)

6.3 Drop columns with NaN values

Finally we can drop columns which has NaN values:

df.dropna(axis=1, how='any')

We can use parameters like:

  • how - 'all' or 'any'
  • subset - list of columns
  • tresh - the number of NaN values required to remove the column
  • inplace

Step 7: Detect and drop duplicate rows in Pandas

To detect duplicate values in the DataFrame we can use the method duplicated(). To detect duplicate rows in Pandas DataFrame we can use:

df[df.duplicated()]

This results in 4 duplicated rows:

4 rows × 246 columns

We can use parameter: keep

  • first - only the first rows
  • last - the last row only
  • False - show all rows

For example get indexes of all detected duplications:

df[df.duplicated(keep=False)].index

Int64Index([11228, 12344, 16413, 16547, 16653, 18705, 19258, 19705], dtype='int64')

Since we have 246 columns (answers) it's pretty suspicious that there are full duplications.

We can use method df.drop_duplicates(subset=['Q1']) in order to drop duplicated rows in Pandas:

df.drop_duplicates(subset=['Q1', 'Q2'])

Step 8: Detect outliers in Pandas

We can detect outliers in Pandas in many ways. Here we will cover basic detection of numeric data:

  • check stats for the column - min, max and percentiles
  • visually by plotting values

Suppose we work with column: 'Time from Start to Finish (seconds)'

We can see the min, max and the percentiles by:

df['Time from Start to Finish (seconds)'].describe()

The result is:

count     19717.000000
mean      14341.281027
std       74166.106601
min          23.000000
25%         340.000000
50%         540.000000
75%         930.000000
max      843612.000000
Name: (Time from Start to Finish (seconds), Duration (in seconds)), dtype: float64

So we have time for the survey from 23 up to 843612 seconds. Probably we can exclude some of them.

Another way to detect outliers is visually by plotting data like:

From the image above we can decide what is the threshold which makes sense for us.

Step 9: Detect errors, typos and misspelling in Pandas

Finally let's check how we can detect typos and misspelled words in Pandas DataFrame. This will show how we can work with inconsistent or incomplete data.

For this purpose we are going to read file - 'other_text_responses.csv' which will be df_other. The reason is that it contains free text input.

Let's read the third column of this DataFrame by:

df_other[df_other.columns[3]].value_counts().head(10)

The result is:

Excel              865
Microsoft Excel    392
excel              263
MS Excel            67
Google Sheets       61
Google sheets       44
Microsoft excel     38
Excel               33
microsoft excel     27
EXCEL               25

We can see different variations of the same tool - Excel.

In order to detect similar values we will use Python library difflib:

import difflib
difflib.get_close_matches('excl', ['Excel', 'Microsoft Excel ', 'MS Excel', 'excel'], n=1, cutoff=0.7)

The result of this will be:

['excel']

So we can use Python in order to detect and fix misspelled words.

Code like the one below can help us create new column with corrected values:

import difflib

correct_values = {}
words = df_other["Q14_Part_3_TEXT"].value_counts(ascending=True).index

for keyword in words:
    similar = difflib.get_close_matches(keyword, words, n=20, cutoff=0.6)
    for x in similar:
        correct_values[x] = keyword
             
df_other["corr"] = df_other["Q14_Part_3_TEXT"].map(correct_values)

Conclusion

In this article, we learned what is clean data and how to do data cleaning in Pandas and Python.

Some topics which we discussed are NaN values, duplicates, drop columns and rows, outlier detection.

We saw all the steps of the data cleaning process with examples. We covered important topics like tidy data and data quality.