Pandas Exercises - View and Explore Data

This is Pandas exercise about data analysis and exploration. It's the first exercise from a series (check the end of the post for more details).

You can test knowledge in several areas:

initial data analysis
"viewing data" - as advertised on the Pandas page: Viewing data
summarizing data

We can think of data exploration as getting to know your food. I think that most of us prefer to see, taste and smell the food - before eating. For me it's similar to data.

Data should be "seen, smelled and tasted".

Setup

In this tutorial we will work with the following dataset: The Movies Dataset.

The dataset contains data for 45,000 movies. Data points include cast, crew, plot keywords, revenue, release dates, languages, production companies, countries, vote averages and more.

We will focus on the file: movies_metadata.csv. Data from the CSV file can be read as follows:

import pandas as pd
df = pd.read_csv('movies_metadata.csv', low_memory=False)

1: How many rows and columns?

Usually the first thing which I do is to get the DataFrame dimension.

It will give us a rough idea about the size of the DataFrame. Why rough? Because you may have nested hierarchical data - multilevel columns.

df.shape

This will give us:

(45466, 24)

This might save you time and effort. Sometimes it is good to start with smaller data samples.

Bonus question 1

Can you guess how many rows and columns?

df.shape[0]

Bonus Question 2

Can you name synonyms or other names of columns and rows?

rows - observations, records, trials

columns - variable, feature

2: View the top and bottom rows

How can we see the first N records of the dataset? What about the last N rows?

- top rows

df.head()

- last rows

df.tail()

Bonus question 3

How can we see bottom and top rows simultaneously in Pandas?

import numpy as np
df.iloc[np.r_[0:5, -5:0]]

Did you solve that one? Good job, it was a tricky one!

Bonus question 4

Do you know why there are parentheses and sometimes not?
like:

df.head()

and:

df.shape

The reason is because

head()

is a method and should be with parentheses. While

shape

is an attribute - so it should be without parentheses

3: Display the index and columns

Can you display the column names? What about index/row names?

df.columns

df.index

We will get about the columns:

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
   'imdb_id', 'original_language', 'original_title', 'overview',
   'popularity', 'poster_path', 'production_companies',
   'production_countries', 'release_date', 'revenue', 'runtime',
   'spoken_languages', 'status', 'tagline', 'title', 'video',
   'vote_average', 'vote_count'],
  dtype='object')

and for the index:

RangeIndex(start=0, stop=45466, step=1)

4: Generate descriptive statistics

What about getting a quick statistical summary for a DataFrame in Pandas? This would give us valuable information about:

count
mean
standard deviation
percentile

df.describe()

This would give us:

	revenue	runtime	vote_average	vote_count
count	4.546000e+04	45203.000000	45460.000000	45460.000000
mean	1.120935e+07	94.128199	5.618207	109.897338
std	6.433225e+07	38.407810	1.924216	491.310374
min	0.000000e+00	0.000000	0.000000	0.000000
25%	0.000000e+00	85.000000	5.000000	3.000000
50%	0.000000e+00	95.000000	6.000000	10.000000
75%	0.000000e+00	107.000000	6.800000	34.000000
max	2.787965e+09	1256.000000	10.000000	14075.000000

The method will analyze both numeric and object series, as well as DataFrame column sets of mixed data types.

5: What are the dtypes?

What are the data types of the columns in a DataFrame. Data types might be several kinds like:

numeric
datetime
string/object
nested data - dictionary, list or json
categorical

It can help us when we work with our data. Think of it like knowing the OS of the computer. Working on Linux differs from Windows? At least for me.

So how to get data types?

df.dtypes

This would give us:

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

6: Transpose rows

How can we see the first N records of the dataset - transposed?

I like to transpose rows when I'm working with few rows with many columns.

The first few rows looks like that(transposed):

	0	1
adult	False	False
belongs_to_collection	{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}	NaN
budget	30000000	65000000
genres	[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]	[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
homepage	http://toystory.disney.com/toy-story	NaN
id	862	8844
imdb_id	tt0114709	tt0113497
original_language	en	en
original_title	Toy Story	Jumanji
overview	Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.	When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.
popularity	21.946943	17.015539
poster_path	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
production_companies	[{'name': 'Pixar Animation Studios', 'id': 3}]	[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]
production_countries	[{'iso_3166_1': 'US', 'name': 'United States of America'}]	[{'iso_3166_1': 'US', 'name': 'United States of America'}]
release_date	1995-10-30	1995-12-15
revenue	373554033.0	262797249.0
runtime	81.0	104.0
spoken_languages	[{'iso_639_1': 'en', 'name': 'English'}]	[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]
status	Released	Released
tagline	NaN	Roll the dice and unleash the excitement!
title	Toy Story	Jumanji
video	False	False
vote_average	7.7	6.9
vote_count	5415.0	2413.0

And transposing rows is possible by:

df.head(2).T

Which is the alias of method transpose(). This is another thing to have in mind working with Pandas - aliases.

7: Sort values

Can you get the top most rated movies from this dataset? To do so we will need to sort data by a column.

So the last exercise is to get the names and number of votes for the most voted movies in that dataset.

df.sort_values(by='vote_count', ascending=False)[['title', 'vote_count']].head()

Bonus question 5

Can you find the top 7 values for the numerical columns? Return the result as a DataFrame:

	revenue	runtime	vote_average	vote_count
0	2.787965e+09	1256.0	10.0	14075.0
1	2.068224e+09	1140.0	10.0	12269.0
2	1.845034e+09	1140.0	10.0	12114.0
3	1.519558e+09	931.0	10.0	12000.0
4	1.513529e+09	925.0	10.0	11444.0
5	1.506249e+09	900.0	10.0	11187.0
6	1.405404e+09	877.0	10.0	10297.0

Can you optimize the proposed solution below?


from pandas.api.types import is_numeric_dtype

dfs = []

for col in df.columns:
    top_values = []
    if is_numeric_dtype(df[col]):
        top_values = df[col].nlargest(n=7)
        dfs.append(pd.DataFrame({col: top_values}).reset_index(drop=True))
pd.concat(dfs, axis=1)

Were you able to solve it? Well done!

Bonus question 6

Method describe() is good for initial exploratory data analysis. For more details like:

type
unique values
missing values
most frequent values
histogram and more

Can you research and find Python library which can do "serious exploratory data analysis"?

We can use library: pandas-profiling · PyPI

ProfileReport(df, title="Pandas Profiling Report")

Conclusion

Congratulations! You have passed the first set of Pandas exercises. We've tasted the movie dataset.

Now you will have a better understanding of your data and how to do initial data analysis.

In this post, we learned about initial data analysis in Pandas. Most popular methods in Pandas for "viewing data".

We saw some advanced techniques and tricks very useful when data is touched for the first time.

Pandas exercises for beginners

Pandas exercises will be structured as a sample project plus exercises - a place where we can practice data science.

The goal is to test your knowledge and learn in 3 main areas:

data cleaning
data analysis
data collection

The areas above are often reported as problematic in data science.

Data preparation is blamed to consume up to 80% from Data science projects. Dirty data is one of the main factors for failed projects in data science and machine learning.

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

Setup

1: How many rows and columns?

Bonus question 1

Bonus Question 2

2: View the top and bottom rows

Bonus question 3

Bonus question 4

3: Display the index and columns

4: Generate descriptive statistics

5: What are the dtypes?

6: Transpose rows

7: Sort values

Bonus question 5

Bonus question 6

Conclusion

Pandas exercises for beginners