In this post we can find free public datasets for Data Science projects. There is a big number of datasets which cover different areas - machine learning, presentation, data analysis and visualization.

You can find information for:

  • Data sources - big datasets collections which has curated data and advanced searching
  • Sample datasets - datasets good for data analysis. Starting point for beginners who would like to learn Data Science
  • Datasets resources - useful resources for datasets which can be loaded easily
  • Datasets from Python libraries - load datasets with single line of code from different Python libraries like 'seaborn'
Note
This post will be updated on regular basis so please suggest new ideas and datasets in the comment section below.

1. Dataset Sources

Below we can find a table of dataset collections. Most of them have advanced searching by:

  • file type
  • size
  • number of rows
  • tags
# site description
1 https://www.kaggle.com/datasets/ https://www.kaggle.com/docs/datasets
2 https://datasetsearch.research.google.com/ Google Dataset Search
3 https://azure.microsoft.com/en-us/services/open-datasets/ Azure Open Datasets
4 https://www.openml.org/search?type=data 4325 datasets found (verified)
5 https://grouplens.org/datasets/ several datasets
6 https://datahub.io/search thousands of datasets

Other soruces:

2. Datasets samples

There are several listed below which are used in this site for demonstration of data science basics:

# dataset size link description
1 the-movies-dataset 45466, 24 https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset https://grouplens.org/datasets/movielens/latest/
2 Food Recipes 8009, 16 https://www.kaggle.com/datasets/sarthak71/food-recipes
3
4
5
6
7

3. Datasets resources

4. Read Kaggle Datasets

To read Kaggle datasets we can use the Python library kaggle. Downloading dataset from kaggle with Python code is available from method: dataset_download_file:

import kaggle

kaggle.api.authenticate()
kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv',  path='data/')

For more information and examples refer to: How to Search and Download Kaggle Dataset to Pandas DataFrame

5. Load Datasets by Python libraries

In this section we can find several useful datasets for different purposes like:

  • machine learning
  • visualization
  • testing
  • creating own datasets with fake data

5.1 datasets - machine learning

Python library datasets offers a huge number of free and easy to use datasets. It can be installed by:

pip install datasets

To list all available datasets we can use method: datasets.list_datasets():

from datasets import list_datasets, load_dataset
print(list_datasets())

It will return more than 7000 datasets.

To load dataset we can use method: datasets.load_dataset(dataset_name, **kwargs):

squad_dataset = load_dataset('squad')
squad_dataset

This give us two datasets:

  • training dataset

  • validation dataset

    DatasetDict({
    train: Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
    })
    validation: Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
    })
    })

To access the dataset for training we can use: squad_dataset['train'].

Finally we can loaded as Pandas DataFrame by:

import pandas as pd
pd.DataFrame(squad_dataset['train'])

5.2 pandas - test datasets

Pandas offers multiple ways to download datasets with a single line of code. Let's cover few of them starting with the test data in Pandas github:

scrape wiki tables

Next we can load data from Pandas by scraping wikipedia:

pd.read_html('https://en.wikipedia.org/wiki/Population_growth')[2]

create datasets

We can create random or fake datasets with Pandas by:

5.3 seaborn - visualization datasets

Seaborn offers free tests which are good for visualization. With single line of code we can get DataFrame good for data wrangling and visualization:

import seaborn as sns
df = sns.load_dataset('flights')

All datasets available from seaborn library: seaborn-data.

sklearn-learn - machine learning

We can get sample datasets from sklearn-learn by methods like: load_iris

from sklearn.datasets import load_iris
iris = load_iris()

To find more sample datasets from sklearn we can use the next code:

from sklearn import datasets
dir(datasets)

This will list all available options like:

'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine',
 'make_biclusters',
 'make_blobs',
 'make_checkerboard',
 'make_circles',

You can find more about sklearn-learn datasets on this link: sklearn.datasets: Datasets.

5.4 dataprep - data analysis

To load dataset we can use method: load_dataset

from dataprep.datasets import load_dataset
df = load_dataset("titanic")

to list datasets we can use:

from dataprep.datasets import get_dataset_names
get_dataset_names()

which results into several datasets like:

['waste_hauler',
 'wine-quality-red',
 'countries',
 'house_prices_train',
 'iris',
 'adult',
 'covid19',
 'titanic',
 'patient_info',
 'house_prices_test']

More information about dataprep datasets: Datasets DataPrep

6. Conclusion

In this article, we covered free datasets sources and discussed common ways to download dataset from them. Through practical examples, we learned how to download and use those datasets in Python and Pandas.

We covered different Python libraries which offer public datasets for learning. Finally, we covered how to create test datasets with fake data.

Those datasets and ideas should be sufficient for practicing and learning data science.