In this short guide, I'll show you the steps to extract a list of keywords matched in a Pandas DataFrame and create new column(s).
In particular, I'll show you how to return keywords from a given column.
The image below shows what is the final outcome:
To start with a simple example, let's say that you have the next Pandas DataFrame:
url | title | keyword |
---|---|---|
https://towardsdatascience.com/training-on-batch-how-to-split-data-effectively-3234f3918b07 | Training on batch: how to split data effectively? | how, data |
https://uxdesign.cc/design-and-data-how-to-humanize-data-32a03079311f | Design & Data: how to humanize data | how, data |
https://medium.com/swlh/fyre-festival-achieved-perfect-product-market-fit-and-thats-why-we-should-question-the-lean-a6a45fcb735a | Fyre Festival achieved perfect product-market fit (and that's why we should question the Lean Startup and VC dogma) | question |
https://uxdesign.cc/using-a-sneak-attack-question-during-your-designer-interviews-b918ff600977 | Using a ‘sneak attack’ question during your Designer interviews | question |
https://medium.com/swlh/a-question-you-may-never-have-asked-culture-fit-or-culture-add-cf65b00770cf | A question you may never have asked — culture fit or culture add? | question |
Notebook with the code: Extract list of keywords from a column in Pandas
Step 1: Read test DataFrame from Kaggle
The DataFrame above is available from Kaggle.
If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame
For this article we will use next code to download and read it:
import kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv', path='data/')
read it by:
import pandas as pd
df = pd.read_csv('data/medium_data.csv.zip')
Step 2: Extract list of keywords from a column to new column
At this point we will define the list of keywords which we like to extract:
keywords = ['how', 'data', 'question', 'guide']
Then we are going to perform the extraction and the addition to a new column:
df['keyword'] = df['title'].str.findall('|'.join(keywords)).apply(set).str.join(', ')
At this point the new column keyword
will contain all keywords found separated by a comma.
Step 3: Extract list of keywords to multiple columns
To extract the list of keywords to different columns use the next syntax:
for keyword in keywords:
df[keyword] = df['title'].str.contains(keyword)
This will iterate over the list of the all columns and create a new column with True
or False
if the word exists or not.
If you like to style your DataFrame in the same way please check: How to style boolean values by different colors in Pandas
Final result can be found on the image below: