This is Pandas exercise about data analysis and exploration. It's the first exercise from a series (check the end of the post for more details).

You can test knowledge in several areas:

  • initial data analysis
  • "viewing data" - as advertised on the Pandas page: Viewing data
  • summarizing data

We can think of data exploration as getting to know your food. I think that most of us prefer to see, taste and smell the food - before eating. For me it's similar to data.

Data should be "seen, smelled and tasted".

Setup

In this tutorial we will work with the following dataset: The Movies Dataset.

The dataset contains data for 45,000 movies. Data points include cast, crew, plot keywords, revenue, release dates, languages, production companies, countries, vote averages and more.

We will focus on the file: movies_metadata.csv. Data from the CSV file can be read as follows:

import pandas as pd
df = pd.read_csv('movies_metadata.csv', low_memory=False)

1: How many rows and columns?

Usually the first thing which I do is to get the DataFrame dimension.

It will give us a rough idea about the size of the DataFrame. Why rough? Because you may have nested hierarchical data - multilevel columns.

df.shape

This will give us:

(45466, 24)

This might save you time and effort. Sometimes it is good to start with smaller data samples.

Bonus question 1

Can you guess how many rows and columns?

df.shape[0]

Bonus Question 2

Can you name synonyms or other names of columns and rows?

rows - observations, records, trials
columns - variable, feature

2: View the top and bottom rows

How can we see the first N records of the dataset? What about the last N rows?

- top rows
df.head()
- last rows
df.tail()

Bonus question 3

How can we see bottom and top rows simultaneously in Pandas?

import numpy as np
df.iloc[np.r_[0:5, -5:0]]

Did you solve that one? Good job, it was a tricky one!

Bonus question 4

Do you know why there are parentheses and sometimes not?
like:

df.head()

and:

df.shape
The reason is because
head()
is a method and should be with parentheses. While
shape
is an attribute - so it should be without parentheses

3: Display the index and columns

Can you display the column names? What about index/row names?

df.columns
df.index

We will get about the columns:

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
   'imdb_id', 'original_language', 'original_title', 'overview',
   'popularity', 'poster_path', 'production_companies',
   'production_countries', 'release_date', 'revenue', 'runtime',
   'spoken_languages', 'status', 'tagline', 'title', 'video',
   'vote_average', 'vote_count'],
  dtype='object')

and for the index:

RangeIndex(start=0, stop=45466, step=1)

4: Generate descriptive statistics

What about getting a quick statistical summary for a DataFrame in Pandas? This would give us valuable information about:

  • count
  • mean
  • standard deviation
  • percentile
df.describe()

This would give us:

revenue runtime vote_average vote_count
count 4.546000e+04 45203.000000 45460.000000 45460.000000
mean 1.120935e+07 94.128199 5.618207 109.897338
std 6.433225e+07 38.407810 1.924216 491.310374
min 0.000000e+00 0.000000 0.000000 0.000000
25% 0.000000e+00 85.000000 5.000000 3.000000
50% 0.000000e+00 95.000000 6.000000 10.000000
75% 0.000000e+00 107.000000 6.800000 34.000000
max 2.787965e+09 1256.000000 10.000000 14075.000000

The method will analyze both numeric and object series, as well as DataFrame column sets of mixed data types.

5: What are the dtypes?

What are the data types of the columns in a DataFrame. Data types might be several kinds like:

  • numeric
  • datetime
  • string/object
  • nested data - dictionary, list or json
  • categorical

It can help us when we work with our data. Think of it like knowing the OS of the computer. Working on Linux differs from Windows? At least for me.

So how to get data types?

df.dtypes

This would give us:

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

6: Transpose rows

How can we see the first N records of the dataset - transposed?

I like to transpose rows when I'm working with few rows with many columns.

The first few rows looks like that(transposed):

0 1
adult False False
belongs_to_collection {'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'} NaN
budget 30000000 65000000
genres [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}] [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
homepage http://toystory.disney.com/toy-story NaN
id 862 8844
imdb_id tt0114709 tt0113497
original_language en en
original_title Toy Story Jumanji
overview Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences. When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.
popularity 21.946943 17.015539
poster_path /rhIRbceoE9lR4veEXuwCC2wARtG.jpg /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
production_companies [{'name': 'Pixar Animation Studios', 'id': 3}] [{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]
production_countries [{'iso_3166_1': 'US', 'name': 'United States of America'}] [{'iso_3166_1': 'US', 'name': 'United States of America'}]
release_date 1995-10-30 1995-12-15
revenue 373554033.0 262797249.0
runtime 81.0 104.0
spoken_languages [{'iso_639_1': 'en', 'name': 'English'}] [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]
status Released Released
tagline NaN Roll the dice and unleash the excitement!
title Toy Story Jumanji
video False False
vote_average 7.7 6.9
vote_count 5415.0 2413.0

And transposing rows is possible by:

df.head(2).T

Which is the alias of method transpose(). This is another thing to have in mind working with Pandas - aliases.

7: Sort values

Can you get the top most rated movies from this dataset? To do so we will need to sort data by a column.

So the last exercise is to get the names and number of votes for the most voted movies in that dataset.

df.sort_values(by='vote_count', ascending=False)[['title', 'vote_count']].head()

Bonus question 5

Can you find the top 7 values for the numerical columns? Return the result as a DataFrame:

revenue runtime vote_average vote_count
0 2.787965e+09 1256.0 10.0 14075.0
1 2.068224e+09 1140.0 10.0 12269.0
2 1.845034e+09 1140.0 10.0 12114.0
3 1.519558e+09 931.0 10.0 12000.0
4 1.513529e+09 925.0 10.0 11444.0
5 1.506249e+09 900.0 10.0 11187.0
6 1.405404e+09 877.0 10.0 10297.0

Can you optimize the proposed solution below?


from pandas.api.types import is_numeric_dtype

dfs = []

for col in df.columns:
    top_values = []
    if is_numeric_dtype(df[col]):
        top_values = df[col].nlargest(n=7)
        dfs.append(pd.DataFrame({col: top_values}).reset_index(drop=True))
pd.concat(dfs, axis=1)

Were you able to solve it? Well done!

Bonus question 6

Method describe() is good for initial exploratory data analysis. For more details like:

  • type
  • unique values
  • missing values
  • most frequent values
  • histogram and more

Can you research and find Python library which can do "serious exploratory data analysis"?

We can use library: pandas-profiling · PyPI
ProfileReport(df, title="Pandas Profiling Report")

Conclusion

Congratulations! You have passed the first set of Pandas exercises. We've tasted the movie dataset.

Now you will have a better understanding of your data and how to do initial data analysis.

In this post, we learned about initial data analysis in Pandas. Most popular methods in Pandas for "viewing data".

We saw some advanced techniques and tricks very useful when data is touched for the first time.

Pandas exercises for beginners

Pandas exercises will be structured as a sample project plus exercises - a place where we can practice data science.

The goal is to test your knowledge and learn in 3 main areas:

  • data cleaning
  • data analysis
  • data collection

The areas above are often reported as problematic in data science.

Data preparation is blamed to consume up to 80% from Data science projects. Dirty data is one of the main factors for failed projects in data science and machine learning.