Exploratory Data Analysis Python and Pandas with Examples

This article is about Exploratory Data Analysis(EDA) in Pandas and Python. The article will explain step by step how to do Exploratory Data Analysis plus examples.

EDA is an important step in Data Science. The goal of EDA is to identify errors, insights, relations, outliers and more.

The image below illustrate the data science workflow and where EDA is located:

Source: Exploratory Data Analysis - wikipedia

Background Story

Imagine that you are expecting royal guests for dinner. You are asked to research a special menu from a cooking book with thousands of recipes. As they are very pretentious you need to avoid some ingredients or find exact quantities for others. Dinner and launch menus are needed.

Unfortunately some recipes are wrong and others are incomplete.

How to start? You may start with the main ingredients. Another option is by selecting the meal type and course. There should be a special menu for Vegetarians.

The story above reminds of Exploratory Data Analysis(EDA). All the steps and questions are part of understanding unknown territory. Let's try to solve this with Data Science.

What is Exploratory Data Analysis?

Exploratory data analysis is like detective work: searching for insights that identify problems and hidden patterns.

Start with one variable at a time, then explore two variables, and so on.

Definition
Exploratory Data Analysis(EDA) - analyze and investigate datasets and summarize their main characteristics and apply visualization methods.

Tools for EDA:

Most popular Tools:

  • Pandas/Python
  • R

Techniques for EDA:

Visual techniques

Dimensionality reduction

Exploratory Data Analysis - Example Project

Let's do an example on Exploratory Data Analysis using Food Recipes.

We will do step by step analysis on this data set and answer on questions like:

  • What data do we have?
  • What is the dimension of this data?
  • Are there any dependent variables?
  • What are the data types?
  • Missing data?
  • Duplicate data?
  • Correlations?

You can find the notebook on and Google Colab:

Step 1: Load Data and Initial Analysis

First we will import the needed libraries and then load the dataset into Pandas:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/softhints/Pandas-Exercises-Projects/main/data/food_recipes.csv', low_memory=False)

Step 2: Initial Analysis of Pandas DataFrame

We will check the data by using the following methods:

  • df - returns first and last 5 records; returns number of rows and columns
  • head(n) - returns first n rows
  • tail(n) - returns last n rows
  • sample(n) - sample random n rows

The first 2 rows transposed looks like:

0 1
recipe_title Roasted Peppers And Mushroom Tortilla Pizza Recipe Thakkali Gotsu Recipe | Thakkali Curry | Spicy & Tangy Tomato Gravy
url https://www.archanaskitchen.com/roasted-peppers-and-mushroom-tortilla-pizza-recipe https://www.archanaskitchen.com/tomato-gotsu-recipe-spicy-tangy-tomato-gravy
record_health good good
vote_count 434 3423
rating 4.958525 4.932223
description is a quicker version pizza ....: also known as the is a ...
cuisine Mexican South Indian Recipes
course Dinner Lunch
diet Vegetarian Vegetarian
prep_time 15 M 10 M
cook_time 15 M 20 M
ingredients Tortillas|Extra Virgin Olive Oil|Garlic|Mozzarella cheese|Red Yellow or Green Bell Pepper (Capsicum)|Onions|Kalmatta olives|Button mushrooms Sesame (Gingelly) Oil|Mustard seeds (Rai/ Kadugu)|Curry leaves|Garlic|Pearl onions (Sambar Onions)|Tomatoes|Tamarind|Turmeric powder (Haldi)|Salt|Jaggery
instructions To begin making the Roasted ... To begin making Tomato Gotsu Recipe/ ....
author Divya Shivaraman Archana Doshi
tags Party Food Recipes|Tea Party Recipes|Mushroom Recipes|Fusion Recipes|Tortilla Recipe|Bell Peppers Recipes Vegetarian Recipes|Tomato Recipes|South Indian Recipes|Breakfast Recipe Ideas
category Pizza Recipes Indian Curry Recipes

We are lucky with:

df.sample(2)

as we can see missing values and non latin symbols:

recipe_title url record_health vote_count rating description cuisine course diet prep_time cook_time ingredients instructions author tags category
2400 Coconut Chia Pudding Recipe https://www.archanaskitchen.com/coconut-chia-pudding-recipe good 162 4.975309 Coconut Chia Pudding makes a great snack or even ... NaN NaN NaN NaN NaN Chia Seeds|Tender coconut water|Coconut milk|Orange|Blueberries|Pistachios|Honey|Mint Leaves (Pudina) To begin making the Coconut Chia Pudding recipe... Gauravi Vinay Healthy Recipes|Breakfast Recipe Ideas|Pudding Recipes|Coconut Recipes Dessert Recipes
1447 पंजाबी स्टाइल बूंदी कढ़ी रेसिपी - Punjabi Style Boondi Kadhi Recipe https://www.archanaskitchen.com/punjabi-style-boondi-kadhi-recipe-in-hindi good 685 4.954745 एक प्रसिद्ध कढ़ी है जिसे राजस्थान और पंजाब में बनाया जाता है. इसमें मेथी के दाने भी डाले जाते है ... Punjabi Side Dish Vegetarian 10 M 30 M दही|बेसन|हरी मिर्च|अदरक लहसुन का पेस्ट|लाल मिर्च पाउडर|हल्दी पाउडर|नमक|घी|लॉन्ग|पूरी काली मिर्च|जीरा|मेथी के दाने|हींग|बूंदी पंजाबी स्टाइल बूंदी कढ़ी रेसिपी बनाने के लिए सबसे पहले एक बाउल में ... Archana's Kitchen Side Dish Recipes|Indian Lunch Recipes|Dahi Recipes (Curd/Yogurt)|Indian Dinner Recipes|Winter Recipes Kadhi Recipes

The initial findings for this dataset are:

Insights

  • dataset has header column
  • separator is ';'
  • NaN represents missing values
  • 2 columns with nested data - ingredients and tags
    • if we take closer look we can find values divided by pipes - "|"
    • for further analysis - data need to be normalized
  • there are non latin characters in this dataset

Let's continue with the analysis and not rely on luck or first impressions.

Step 3: Dimension and data types of DataFrame

Next we will use several Pandas methods in order to find information for the columns, dimensions, nested data and more:

df.shape

will return the dimension of this dataset:

(8009, 16)

We can get columns by:

df.columns
Index(['recipe_title', 'url', 'record_health', 'vote_count', 'rating',
       'description', 'cuisine', 'course', 'diet', 'prep_time', 'cook_time',
       'ingredients', 'instructions', 'author', 'tags', 'category'],
      dtype='object')

To get information about the data types we can use:

df.info()
df.dtypes

which results into:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8009 entries, 0 to 8008
Data columns (total 16 columns):
Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   recipe_title   8009 non-null   object
 1   url            8009 non-null   object
 2   record_health  8009 non-null   object
 3   vote_count     8009 non-null   int64  
 4   rating         8009 non-null   float64
 5   description    7994 non-null   object
 6   cuisine        7943 non-null   object
 7   course         7854 non-null   object
 8   diet           7858 non-null   object
 9   prep_time      7979 non-null   object
 10  cook_time      7979 non-null   object
 11  ingredients    7997 non-null   object
 12  instructions   8009 non-null   object
 13  author         8009 non-null   object
 14  tags           7930 non-null   object
 15  category       8009 non-null   object
dtypes: float64(1), int64(1), object(14)
memory usage: 1001.2+ KB

The above gives use:

  • 3 data types - float64(1), int64(1), object(14)
  • memory usage: 1001.2+ KB
  • information for the rows and the columns

By using the method df.shape() we can find the total number of rows and columns. Which has synonyms:

  • row - observation, record, trial
  • column - variable, feature, characteristic

Data insights from this step:

Insights:

  • dataset contains - 8009 rows and 16 columns
    • 8009 observations
    • 16 characteristics
  • 2 numeric columns
  • 2 which need conversion - prep_time and cook_time
    • convert string to int
    • 15 M -> 15

Step 4: Statistics - min, max, mean, percentile

We can get useful statistics for numeric and non numeric columns of this DataFrame.

4.1. Descriptive statistics for numeric columns

For numeric columns we can use:

df.describe()

which returns stats like:

  • mean
  • standard deviation
  • percentile
  • count
  • min and max
vote_count rating
count 8009.000000 8009.000000
mean 2268.004495 4.888621
std 3683.156570 0.077467
min 15.000000 3.175705
25% 494.000000 4.865031
50% 1050.000000 4.900553
75% 2487.000000 4.930000
max 80628.000000 5.000000

Insights:

  • mean value is close to 75%
  • no missing values
  • big difference between max and 75% percentile for vote_count - potential outliers

4.2. Pandas describe() for non numeric columns

For the rest of the columns we can use include='object':

df.describe(include='object')

the result is:

recipe_title url record_health description cuisine course diet ingredients
count 8009 8009 8009 7994 7943 7854 7858 7997
unique 8009 8009 1 7989 77 13 10 7953
top Roasted Peppers And Mushroom Tortilla Pizza Recipe https://www.archanaskitchen.com/roasted-peppers-and-mushroom-tortilla-pizza-recipe good If you like this recipe, try more recipes like Indian Lunch Vegetarian Lemons|Sugar|Salt|Red Chilli powder
freq 1 1 8009 6 1336 1925 5478 3

Few columns seems to be more interesting than the rest:

Insights:

  • 4 categorical columns - category, cuisine, diet and course
  • 1 column with single value - record_health

Step 5: Inspect single column of DataFrame

df.diet.nunique()

returns the number of unique values in this column:

10
df.diet.unique()

will give us all unique values:

array(['Vegetarian', 'High Protein Vegetarian', 'Non Vegeterian',
       'No Onion No Garlic (Sattvic)', 'High Protein Non Vegetarian',
       'Diabetic Friendly', 'Eggetarian', 'Vegan', 'Gluten Free', nan,
       'Sugar Free Diet'], dtype=object)
df.diet.value_counts(dropna=False)

returns the values and their frequency:

Vegetarian                      5478
High Protein Vegetarian          792
Non Vegeterian                   447
Eggetarian                       391
Diabetic Friendly                296
High Protein Non Vegetarian      225
NaN                              151
No Onion No Garlic (Sattvic)      77
Vegan                             71
Gluten Free                       66
Sugar Free Diet                   15
Name: diet, dtype: int64

Insights:

  • 10 unique values
  • categorical column
  • Vegetarian - is the most frequent value
  • 151 - missing values

Step 6: Data conversion in Pandas

Sometimes data needs to be processed for proper analysis. For example nested data to be expanded, string needs to be converted to number or date.

We'll see two examples of data processing required for further data analysis.

6.1 Convert String to numeric

To analyze columns cook_time and prep_time as numeric we can convert them by:

df['cook_time'] = df['cook_time'].str.replace(' M', '').fillna(0).astype(int)
df['prep_time'] = df['prep_time'].str.replace(' M', '').fillna(0).astype(int)

Now we can apply visual techniques to analyze those columns for outliers - this is done in section 7.3 of this article.

6.2 Expand nested column

Next we will expand a column "ingredients" - in order to analyze ingredients for each recipe:

df['ingredients'].str.split('|', expand=True)

The result is new DataFrame with expanded values from the ingredient column:

from:

0       Tortillas|Extra Virgin Olive Oil|Garlic|Mozzar...
1       Sesame (Gingelly) Oil|Mustard seeds (Rai/ Kadu...
2       Extra Virgin Olive Oil|Pineapple|White onion|R...

to:

0 1 2 3 4
0 Tortillas Extra Virgin Olive Oil Garlic Mozzarella cheese Red Yellow or Green Bell Pepper (Capsicum)
1 Sesame (Gingelly) Oil Mustard seeds (Rai/ Kadugu) Curry leaves Garlic Pearl onions (Sambar Onions)
2 Extra Virgin Olive Oil Pineapple White onion Red Yellow and Green Bell Peppers (Capsicum) Pickled Jalapenos
3 Arhar dal (Split Toor Dal) Turmeric powder (Haldi) Salt Dry Red Chillies Mustard seeds (Rai/ Kadugu)
4 Rajma (Large Kidney Beans) Cashew nuts Sultana Raisins Asafoetida (hing) Cumin seeds (Jeera)

To find the most common 'ingredient' in the all recipes we can use the following steps:

  • split the column on separator "|"
    • expand=True - will expand data to DataFrame
  • melt() - transform DataFrame columns to rows - pairs of variable and value
  • count the values by .value_counts()
df['ingredients'].str.split('|', expand=True).melt()['value'].value_counts()

we found that salt is the most popular ingredient:

Salt                       5698
Sunflower Oil              2241
Garlic                     2227
Turmeric powder (Haldi)    2221
Onion                      2012
                           ...
नारीयल पानी                   1
नारीयल का दूध                 1
सिंघाड़े का आटा               1
Black currant (dried)         1
Corn chips                    1
Name: value, Length: 1823, dtype: int64

Step 7: Visualization for Data Analysis

Finally lets cover a few visualization techniques to find missing values, outliers and correlations.

7.1 Missing values

The first visualization technique is about finding missing values in Pandas DataFrame.

Use library Seaborn with heat map to plot missing values for the whole dataset:

  • df.isna() - returns new DataFrame with True/False values depending on the missing values in a cell
  • cmap = 'Greens' - is a color map. For more info check: Get color palette with matplotlib
sns.heatmap(df.isna(),cmap = 'Greens')

result is:

Next let's check the same visualization for the last 3000 rows:

sns.heatmap(df.tail(3000).isna(),cmap = 'Greens')

result:

From the result above we can conclude:

Insights:

  • data set has missing values in several columns
    • represented by darker green lines
  • cuisine, course and diet
    • plot suggest pattern for missing values of those columns

7.2 Correlations

Again will use Seaborn's heatmap in order to visualize correlations between two and more columns:

sns.heatmap(df.corr(),cmap='Greens',annot=False)

The first option is for the whole dataset shows - no correlation between the columns.

sns.heatmap(df[df.diet=='Vegan'].corr(),cmap='Greens',annot=True)

While the second one is showing correlations for rows with type "Vegan". We can see a correlation between columns 'prep_time' and 'cook_time'.

Insights:

  • no correlation of numerical columns for the whole DataFrame
  • correlation for - Vegan recipes - between prep_time and cook_time
  • darker color represent positive correlation
    • lighter - negative
  • to get values use annot=True

7.3 Detect outliers in Pandas DataFrame

Finally we will use boxplot to detect outliers:

cols = ['vote_count', 'rating',  'prep_time', 'cook_time']
plt.figure(figsize=(20, 5))
sns.boxplot(data=df[cols], orient='h')

The picture below suggest about outliers in the numeric columns

Insights:

  • vote_count - suggest outliers
  • cook_time - has possible outliers
  • rating - doesn't indicate outliers

7.4 Explain Seaborn boxplot and outliers

In order to get a better understanding we will work only with two columns: 'prep_time', 'cook_time' and the first 18 records.

The stats for them are:

df[['prep_time', 'cook_time']].head(18).describe()
prep_time cook_time
count 18.000000 18.000000
mean 12.222222 22.388889
std 4.608886 17.091928
min 5.000000 0.000000
25% 10.000000 11.250000
50% 10.000000 20.000000
75% 15.000000 30.000000
max 20.000000 60.000000

The picture below illustrates how to analyze a dataset for outliers with boxplot.

Boxplot has 3 main components:

  • box (Interquartile Range - IQR)
    • q1 - 25%
    • q2 - 50% or median
    • q3 - 75%
  • whiskers
    • min
    • max
  • outliers
    • points beyond the whiskers

We can notice that sometimes the whiskers don't match the min/max. They are calculated based on formulas. Upper and low whiskers are calculated as 1.5 multiplied on the interquartile range - IQR(the box).

Definition
An outlier(extreme) is a value that lies in the extremes of a data Series(on either end). They can affect overall observation and insights.

In the next article we will cover more techniques and several Python libraries for exploratory data analysis.

Conclusion

In this article, we covered the basics of Exploratory Data Analysis and discussed most of the common methods which every data scientist should know while working with datasets. Through practical examples, we learned how to explore new datasets.

We started by initial data analysis and first glance of data. After that, we made data conversion and preprocessing for further analysis. Finally, we saw multiple visual techniques in order to find missing values, outliers and correlations.