In this post we can find free public datasets for Data Science projects. There is a big number of datasets which cover different areas - machine learning, presentation, data analysis and visualization.
You can find information for:
- Data sources - big datasets collections which has curated data and advanced searching
- Sample datasets - datasets good for data analysis. Starting point for beginners who would like to learn Data Science
- Datasets resources - useful resources for datasets which can be loaded easily
- Datasets from Python libraries - load datasets with single line of code from different Python libraries like 'seaborn'
This post will be updated on regular basis so please suggest new ideas and datasets in the comment section below.
1. Dataset Sources
Below we can find a table of dataset collections. Most of them have advanced searching by:
- file type
- size
- number of rows
- tags
# | site | description |
---|---|---|
1 | https://www.kaggle.com/datasets/ | https://www.kaggle.com/docs/datasets |
2 | https://datasetsearch.research.google.com/ | Google Dataset Search |
3 | https://azure.microsoft.com/en-us/services/open-datasets/ | Azure Open Datasets |
4 | https://www.openml.org/search?type=data | 4325 datasets found (verified) |
5 | https://grouplens.org/datasets/ | several datasets |
6 | https://datahub.io/search | thousands of datasets |
Other soruces:
2. Datasets samples
There are several listed below which are used in this site for demonstration of data science basics:
# | dataset | size | link | description | |||||
---|---|---|---|---|---|---|---|---|---|
1 | the-movies-dataset | 45466, 24 | https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset | https://grouplens.org/datasets/movielens/latest/ | |||||
2 | Food Recipes | 8009, 16 | https://www.kaggle.com/datasets/sarthak71/food-recipes | ||||||
3 | |||||||||
4 | |||||||||
5 | |||||||||
6 | |||||||||
7 |
3. Datasets resources
- 93 Datasets That Load With A Single Line of Code
- Wikidata - the free knowledge base with 99,946,076 data items
- UN Data
4. Read Kaggle Datasets
To read Kaggle datasets we can use the Python library kaggle
. Downloading dataset from kaggle with Python code is available from method: dataset_download_file
:
import kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv', path='data/')
For more information and examples refer to: How to Search and Download Kaggle Dataset to Pandas DataFrame
5. Load Datasets by Python libraries
In this section we can find several useful datasets for different purposes like:
- machine learning
- visualization
- testing
- creating own datasets with fake data
5.1 datasets - machine learning
Python library datasets
offers a huge number of free and easy to use datasets. It can be installed by:
pip install datasets
To list all available datasets we can use method: datasets.list_datasets()
:
from datasets import list_datasets, load_dataset
print(list_datasets())
It will return more than 7000 datasets.
To load dataset we can use method: datasets.load_dataset(dataset_name, **kwargs)
:
squad_dataset = load_dataset('squad')
squad_dataset
This give us two datasets:
-
training dataset
-
validation dataset
DatasetDict({
train: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 87599
})
validation: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 10570
})
})
To access the dataset for training we can use: squad_dataset['train']
.
Finally we can loaded as Pandas DataFrame by:
import pandas as pd
pd.DataFrame(squad_dataset['train'])
5.2 pandas - test datasets
Pandas offers multiple ways to download datasets with a single line of code. Let's cover few of them starting with the test data in Pandas github:
scrape wiki tables
Next we can load data from Pandas by scraping wikipedia:
pd.read_html('https://en.wikipedia.org/wiki/Population_growth')[2]
create datasets
We can create random or fake datasets with Pandas by:
- How To Make a Fake Data Set in Python and Pandas
- How to Easily Create Dummy DataFrame with Test Data?
- How to Create a Pandas DataFrame of Random Integers
5.3 seaborn - visualization datasets
Seaborn offers free tests which are good for visualization. With single line of code we can get DataFrame good for data wrangling and visualization:
import seaborn as sns
df = sns.load_dataset('flights')
All datasets available from seaborn library: seaborn-data.
sklearn-learn - machine learning
We can get sample datasets from sklearn-learn
by methods like: load_iris
from sklearn.datasets import load_iris
iris = load_iris()
To find more sample datasets from sklearn we can use the next code:
from sklearn import datasets
dir(datasets)
This will list all available options like:
'load_sample_images',
'load_svmlight_file',
'load_svmlight_files',
'load_wine',
'make_biclusters',
'make_blobs',
'make_checkerboard',
'make_circles',
You can find more about sklearn-learn datasets on this link: sklearn.datasets: Datasets.
5.4 dataprep - data analysis
To load dataset we can use method: load_dataset
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
to list datasets we can use:
from dataprep.datasets import get_dataset_names
get_dataset_names()
which results into several datasets like:
['waste_hauler',
'wine-quality-red',
'countries',
'house_prices_train',
'iris',
'adult',
'covid19',
'titanic',
'patient_info',
'house_prices_test']
More information about dataprep datasets: Datasets DataPrep
6. Conclusion
In this article, we covered free datasets sources and discussed common ways to download dataset from them. Through practical examples, we learned how to download and use those datasets in Python and Pandas.
We covered different Python libraries which offer public datasets for learning. Finally, we covered how to create test datasets with fake data.
Those datasets and ideas should be sufficient for practicing and learning data science.