Free Public Datasets for Data Science Projects

In this post we can find free public datasets for Data Science projects. There is a big number of datasets which cover different areas - machine learning, presentation, data analysis and visualization.

You can find information for:

Data sources - big datasets collections which has curated data and advanced searching
Sample datasets - datasets good for data analysis. Starting point for beginners who would like to learn Data Science
Datasets resources - useful resources for datasets which can be loaded easily
Datasets from Python libraries - load datasets with single line of code from different Python libraries like 'seaborn'

Note
This post will be updated on regular basis so please suggest new ideas and datasets in the comment section below.

1. Dataset Sources

Below we can find a table of dataset collections. Most of them have advanced searching by:

file type
size
number of rows
tags

#	site	description
1	https://www.kaggle.com/datasets/	https://www.kaggle.com/docs/datasets
2	https://datasetsearch.research.google.com/	Google Dataset Search
3	https://azure.microsoft.com/en-us/services/open-datasets/	Azure Open Datasets
4	https://www.openml.org/search?type=data	4325 datasets found (verified)
5	https://grouplens.org/datasets/	several datasets
6	https://datahub.io/search	thousands of datasets

Other soruces:

World Bank Open Data

2. Datasets samples

There are several listed below which are used in this site for demonstration of data science basics:

#	dataset	size	link	description
1	the-movies-dataset	45466, 24	https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset	https://grouplens.org/datasets/movielens/latest/
2	Food Recipes	8009, 16	https://www.kaggle.com/datasets/sarthak71/food-recipes
3
4
5
6
7

3. Datasets resources

4. Read Kaggle Datasets

To read Kaggle datasets we can use the Python library kaggle. Downloading dataset from kaggle with Python code is available from method: dataset_download_file:

import kaggle

kaggle.api.authenticate()
kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv',  path='data/')

For more information and examples refer to: How to Search and Download Kaggle Dataset to Pandas DataFrame

5. Load Datasets by Python libraries

In this section we can find several useful datasets for different purposes like:

machine learning
visualization
testing
creating own datasets with fake data

5.1 datasets - machine learning

Python library datasets offers a huge number of free and easy to use datasets. It can be installed by:

pip install datasets

To list all available datasets we can use method: datasets.list_datasets():

from datasets import list_datasets, load_dataset
print(list_datasets())

It will return more than 7000 datasets.

To load dataset we can use method: datasets.load_dataset(dataset_name, **kwargs):

squad_dataset = load_dataset('squad')
squad_dataset

This give us two datasets:

training dataset
validation dataset

DatasetDict({
train: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 87599
})
validation: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 10570
})
})

To access the dataset for training we can use: squad_dataset['train'].

Finally we can loaded as Pandas DataFrame by:

import pandas as pd
pd.DataFrame(squad_dataset['train'])

5.2 pandas - test datasets

Pandas offers multiple ways to download datasets with a single line of code. Let's cover few of them starting with the test data in Pandas github:

Pandas data files - csv, xml, html
- tips.csv

scrape wiki tables

Next we can load data from Pandas by scraping wikipedia:

pd.read_html('https://en.wikipedia.org/wiki/Population_growth')[2]

create datasets

We can create random or fake datasets with Pandas by:

5.3 seaborn - visualization datasets

Seaborn offers free tests which are good for visualization. With single line of code we can get DataFrame good for data wrangling and visualization:

import seaborn as sns
df = sns.load_dataset('flights')

All datasets available from seaborn library: seaborn-data.

sklearn-learn - machine learning

We can get sample datasets from sklearn-learn by methods like: load_iris

from sklearn.datasets import load_iris
iris = load_iris()

To find more sample datasets from sklearn we can use the next code:

from sklearn import datasets
dir(datasets)

This will list all available options like:

'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine',
 'make_biclusters',
 'make_blobs',
 'make_checkerboard',
 'make_circles',

You can find more about sklearn-learn datasets on this link: sklearn.datasets: Datasets.

5.4 dataprep - data analysis

To load dataset we can use method: load_dataset

from dataprep.datasets import load_dataset
df = load_dataset("titanic")

to list datasets we can use:

from dataprep.datasets import get_dataset_names
get_dataset_names()

which results into several datasets like:

['waste_hauler',
 'wine-quality-red',
 'countries',
 'house_prices_train',
 'iris',
 'adult',
 'covid19',
 'titanic',
 'patient_info',
 'house_prices_test']

More information about dataprep datasets: Datasets DataPrep

6. Top 10 sites with interesting datasets

The table below is based on this Kaggle list: Top 10 sites with interesting datasets

Here's a concise Markdown table:

| Dataset Source                 | Description |
|--------------------------------|-------------|
| **UCI Machine Learning Repository** | Collection of datasets for research and education. |
| **Google Dataset Search** | Search engine for datasets across various domains. |
| **Data.gov** | U.S. government's open data portal. |
| **World Bank Open Data** | Global development datasets on education, health, and more. |
| **Reddit's r/datasets** | Community-driven subreddit for sharing datasets. |
| **AWS Public Datasets** | Large public datasets hosted by AWS. |
| **FiveThirtyEight** | Data journalism site with political, sports, and economic datasets. |
| **Data.gov.uk** | UK government's open data portal. |
| **DataHub.io** | Platform for open datasets across multiple fields. |
| **Data.world** | Collaborative platform for dataset discovery and sharing. |

This keeps it clear and to the point.

7. Conclusion

In this article, we covered free datasets sources and discussed common ways to download dataset from them. Through practical examples, we learned how to download and use those datasets in Python and Pandas.

We covered different Python libraries which offer public datasets for learning. Finally, we covered how to create test datasets with fake data.

Those datasets and ideas should be sufficient for practicing and learning data science.

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

1. Dataset Sources

2. Datasets samples

3. Datasets resources

4. Read Kaggle Datasets

5. Load Datasets by Python libraries

5.1 datasets - machine learning

5.2 pandas - test datasets

scrape wiki tables

create datasets

5.3 seaborn - visualization datasets

sklearn-learn - machine learning

5.4 dataprep - data analysis

6. Top 10 sites with interesting datasets

7. Conclusion