How to Search and Download Kaggle Dataset to Pandas DataFrame
In this post, we'll take a brief look at the Kaggle Datasets and how to download/import them with Python. By the end, we'll see how to list, download single or multiple datasets and finally how to read them into Pandas DataFrame.
Step 1: Create Kaggle API token
First you will need to visit: Kaggle and create a new account. You can sign up with your google account.
In order to create new Kaggle API token follow:
- Open your profile picture(top right)
- Account - the url is:
https://www.kaggle.com/<username>/account
- API
- Create new API Token
- This will generate
kaggle.json
file - Place the file in your home folder as:
~/.kaggle/kaggle.json
- For more security (optional) -
chmod 600 ~/.kaggle/kaggle.json
More info is available on this link: Kaggle API
Step 2: Install Python's package for Kaggle
Next we are going to install the package which is going to download the datasets from Kaggle. You can install kaggle package in virtual environment by:
pip install kaggle
or for the user:
pip install --user kaggle
Step 3: Download single file from Kaggle dataset
Now we are going to demonstrate how to download a single CSV file from the Kaggle dataset. This will work only if previous steps were done successfully:
import kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv', path='data/')
In the example above we are going to download file: medium_data.csv
from: dorianlazar/medium-articles-dataset.
The file will be downloaded in the folder data/
.
The file can be read by:
import pandas as pd
pd.read_csv('data/medium_data.csv.zip')
which produce:
Step 4: Download multiple files from Kaggle dataset
If we like to get all files from a Kaggle dataset then we can get them by:
import kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_files('dorianlazar/medium-articles-dataset', path='data/')
Note that this might be pretty slow for big datasets. We are downloading all files from the dataset mentioned above.
Step 5: List and search Kaggle datasets with API
Finally let's find how to list and search for Kaggle datasets. This can be done by next command:
!kaggle datasets list -s article
Where we are searching for the keyword - article
. The output is:
ref | title | size | lastUpdated | downloadCount | voteCount | usabilityRating | |
---|---|---|---|---|---|---|---|
----------------------------------------------------------- | ----------------------------------------------- | ----- | ------------------- | ------------- | --------- | --------------- | |
dorianlazar/medium-articles-dataset | Medium articles dataset | 1GB | 2020-06-30 14:13:56 | 1804 | 103 | 0.9411765 | |
hsankesara/medium-articles | Medium Articles | 1MB | 2018-06-17 08:45:49 | 1983 | 72 | 0.88235295 | |
gspmoreira/articles-sharing-reading-from-cit-deskdrop | Articles sharing and reading from CI&T DeskDrop | 8MB | 2017-08-27 21:33:01 | 10062 | 135 | 0.8235294 | |
jkkphys/english-wikipedia-articles-20170820-sqlite | English Wikipedia Articles 2017-08-20 SQLite | 7GB | 2018-11-27 21:54:22 | 1417 | 84 | 0.875 | |
asad1m9a9h6mood/news-articles | News Articles | 2MB | 2017-04-30 11:02:29 | 2731 | 31 | 0.8235294 | |
residentmario/wikipedia-article-titles | Wikipedia Article Titles | 73MB | 2017-09-22 16:42:20 | 726 | 26 | 0.75 | |
abhishek/10k-german-news-articles | 10k German News Articles | 123MB | 2019-11-07 08:50:32 | 552 | 89 | 0.8235294 | |
yufengdev/bbc-fulltext-and-category | BBC articles fulltext and category | 2MB | 2018-06-08 05:44:22 | 3799 | 35 | 0.64705884 | |
danofer/dbpedia-classes | DBPedia Classes | 166MB | 2019-07-04 11:30:52 | 979 | 26 | 1.0 | |
vetrirah/janatahack-independence-day-2020-ml-hackathon | NLP on Research Articles | 11MB | 2020-08-19 14:35:13 | 309 | 24 | 1.0 | |
jkkphys/english-wikipedia-articles-20170820-models | English Wikipedia Articles 2017-08-20 Models | 925MB | 2018-11-28 17:09:32 | 379 | 19 | 0.8125 | |
blessondensil294/topic-modeling-for-research-articles | Topic Modeling for Research Articles | 11MB | 2020-08-18 08:53:26 | 321 | 21 | 1.0 | |
szymonjanowski/internet-articles-data-with-users-engagement | Internet news data with readers engagement | 3MB | 2020-11-21 17:09:57 | 4069 | 330 | 0.9411765 | |
maxscheijen/dutch-news-articles | Dutch News Articles | 135MB | 2021-05-24 08:01:12 | 104 | 13 | 1.0 | |
aiswaryaramachandran/medium-articles-with-content | Medium Articles (with Content) | 218MB | 2018-11-10 18:17:46 | 569 | 29 | 0.7352941 | |
jeet2016/us-financial-news-articles | US Financial News Articles | 1GB | 2018-09-05 01:27:43 | 2530 | 54 | 0.625 | |
urbanbricks/wikipedia-promotional-articles | Wikipedia Promotional Articles | 201MB | 2019-10-27 16:31:06 | 276 | 15 | 1.0 | |
hkapoor/indian-financial-news-articles-20032020 | Indian financial news articles (2003-2020) | 3MB | 2020-05-26 20:41:29 | 233 | 25 | 1.0 | |
naharrison/particle-identification-from-detector-responses | Particle Identification from Detector Responses | 83MB | 2018-10-24 21:14:33 | 298 | 28 | 0.7058824 | |
zshujon/40k-bangla-newspaper-article | 40k Bangla Newspaper Article | 64MB | 2018-09-22 09:54:40 | 276 | 10 | 0.5625 |