This is Pandas exercise about data analysis and exploration. It's the first exercise from a series (check the end of the post for more details).
You can test knowledge in several areas:
- initial data analysis
- "viewing data" - as advertised on the Pandas page: Viewing data
- summarizing data
We can think of data exploration as getting to know your food. I think that most of us prefer to see, taste and smell the food - before eating. For me it's similar to data.
Setup
In this tutorial we will work with the following dataset: The Movies Dataset.
The dataset contains data for 45,000 movies. Data points include cast, crew, plot keywords, revenue, release dates, languages, production companies, countries, vote averages and more.
We will focus on the file: movies_metadata.csv
. Data from the CSV file can be read as follows:
import pandas as pd
df = pd.read_csv('movies_metadata.csv', low_memory=False)
1: How many rows and columns?
Usually the first thing which I do is to get the DataFrame dimension.
It will give us a rough idea about the size of the DataFrame. Why rough? Because you may have nested hierarchical data - multilevel columns.
df.shape
This will give us:
(45466, 24)
This might save you time and effort. Sometimes it is good to start with smaller data samples.
Bonus question 1
Can you guess how many rows and columns?
df.shape[0]
Bonus Question 2
Can you name synonyms or other names of columns and rows?
2: View the top and bottom rows
How can we see the first N records of the dataset? What about the last N rows?
df.head()
- last rowsdf.tail()
Bonus question 3
How can we see bottom and top rows simultaneously in Pandas?
import numpy as np
df.iloc[np.r_[0:5, -5:0]]
Did you solve that one? Good job, it was a tricky one!
Bonus question 4
Do you know why there are parentheses and sometimes not?
like:
df.head()
and:
df.shape
head()
is a method and should be with parentheses. While
shape
is an attribute - so it should be without parentheses
3: Display the index and columns
Can you display the column names? What about index/row names?
df.columns
df.index
We will get about the columns:
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
'imdb_id', 'original_language', 'original_title', 'overview',
'popularity', 'poster_path', 'production_companies',
'production_countries', 'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline', 'title', 'video',
'vote_average', 'vote_count'],
dtype='object')
and for the index:
RangeIndex(start=0, stop=45466, step=1)
4: Generate descriptive statistics
What about getting a quick statistical summary for a DataFrame in Pandas? This would give us valuable information about:
- count
- mean
- standard deviation
- percentile
df.describe()
This would give us:
revenue | runtime | vote_average | vote_count | |
---|---|---|---|---|
count | 4.546000e+04 | 45203.000000 | 45460.000000 | 45460.000000 |
mean | 1.120935e+07 | 94.128199 | 5.618207 | 109.897338 |
std | 6.433225e+07 | 38.407810 | 1.924216 | 491.310374 |
min | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000e+00 | 85.000000 | 5.000000 | 3.000000 |
50% | 0.000000e+00 | 95.000000 | 6.000000 | 10.000000 |
75% | 0.000000e+00 | 107.000000 | 6.800000 | 34.000000 |
max | 2.787965e+09 | 1256.000000 | 10.000000 | 14075.000000 |
The method will analyze both numeric and object series, as well as DataFrame column sets of mixed data types.
5: What are the dtypes?
What are the data types of the columns in a DataFrame. Data types might be several kinds like:
- numeric
- datetime
- string/object
- nested data - dictionary, list or json
- categorical
It can help us when we work with our data. Think of it like knowing the OS of the computer. Working on Linux differs from Windows? At least for me.
So how to get data types?
df.dtypes
This would give us:
adult object
belongs_to_collection object
budget object
genres object
homepage object
id object
imdb_id object
original_language object
original_title object
overview object
popularity object
poster_path object
production_companies object
production_countries object
release_date object
revenue float64
runtime float64
spoken_languages object
status object
tagline object
title object
video object
vote_average float64
vote_count float64
dtype: object
6: Transpose rows
How can we see the first N records of the dataset - transposed?
I like to transpose rows when I'm working with few rows with many columns.
The first few rows looks like that(transposed):
0 | 1 | |
---|---|---|
adult | False | False |
belongs_to_collection | {'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'} | NaN |
budget | 30000000 | 65000000 |
genres | [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}] | [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}] |
homepage | http://toystory.disney.com/toy-story | NaN |
id | 862 | 8844 |
imdb_id | tt0114709 | tt0113497 |
original_language | en | en |
original_title | Toy Story | Jumanji |
overview | Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences. | When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures. |
popularity | 21.946943 | 17.015539 |
poster_path | /rhIRbceoE9lR4veEXuwCC2wARtG.jpg | /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg |
production_companies | [{'name': 'Pixar Animation Studios', 'id': 3}] | [{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}] |
production_countries | [{'iso_3166_1': 'US', 'name': 'United States of America'}] | [{'iso_3166_1': 'US', 'name': 'United States of America'}] |
release_date | 1995-10-30 | 1995-12-15 |
revenue | 373554033.0 | 262797249.0 |
runtime | 81.0 | 104.0 |
spoken_languages | [{'iso_639_1': 'en', 'name': 'English'}] | [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}] |
status | Released | Released |
tagline | NaN | Roll the dice and unleash the excitement! |
title | Toy Story | Jumanji |
video | False | False |
vote_average | 7.7 | 6.9 |
vote_count | 5415.0 | 2413.0 |
And transposing rows is possible by:
df.head(2).T
Which is the alias of method transpose()
. This is another thing to have in mind working with Pandas - aliases.
7: Sort values
Can you get the top most rated movies from this dataset? To do so we will need to sort data by a column.
So the last exercise is to get the names and number of votes for the most voted movies in that dataset.
df.sort_values(by='vote_count', ascending=False)[['title', 'vote_count']].head()
Bonus question 5
Can you find the top 7 values for the numerical columns? Return the result as a DataFrame:
revenue | runtime | vote_average | vote_count | |
---|---|---|---|---|
0 | 2.787965e+09 | 1256.0 | 10.0 | 14075.0 |
1 | 2.068224e+09 | 1140.0 | 10.0 | 12269.0 |
2 | 1.845034e+09 | 1140.0 | 10.0 | 12114.0 |
3 | 1.519558e+09 | 931.0 | 10.0 | 12000.0 |
4 | 1.513529e+09 | 925.0 | 10.0 | 11444.0 |
5 | 1.506249e+09 | 900.0 | 10.0 | 11187.0 |
6 | 1.405404e+09 | 877.0 | 10.0 | 10297.0 |
Can you optimize the proposed solution below?
from pandas.api.types import is_numeric_dtype
dfs = []
for col in df.columns:
top_values = []
if is_numeric_dtype(df[col]):
top_values = df[col].nlargest(n=7)
dfs.append(pd.DataFrame({col: top_values}).reset_index(drop=True))
pd.concat(dfs, axis=1)
Were you able to solve it? Well done!
Bonus question 6
Method describe()
is good for initial exploratory data analysis. For more details like:
- type
- unique values
- missing values
- most frequent values
- histogram and more
Can you research and find Python library which can do "serious exploratory data analysis"?
ProfileReport(df, title="Pandas Profiling Report")
Conclusion
Congratulations! You have passed the first set of Pandas exercises. We've tasted the movie dataset.
Now you will have a better understanding of your data and how to do initial data analysis.
In this post, we learned about initial data analysis in Pandas. Most popular methods in Pandas for "viewing data".
We saw some advanced techniques and tricks very useful when data is touched for the first time.
Pandas exercises for beginners
Pandas exercises will be structured as a sample project plus exercises - a place where we can practice data science.
The goal is to test your knowledge and learn in 3 main areas:
- data cleaning
- data analysis
- data collection
The areas above are often reported as problematic in data science.