How to Add New Column Based on List of Keywords in Pandas DataFrame

How to Add New Column Based on List of Keywords in Pandas DataFrame

In this short guide, I'll show you the steps to extract a list of keywords matched in a Pandas DataFrame and create new column(s).

In particular, I'll show you how to return keywords from a given column.

The image below shows what is the final outcome:

add-new-column-list-keywords-pandas-dataframe

To start with a simple example, let's say that you have the next Pandas DataFrame:

url title keyword
https://towardsdatascience.com/training-on-batch-how-to-split-data-effectively-3234f3918b07 Training on batch: how to split data effectively? how, data
https://uxdesign.cc/design-and-data-how-to-humanize-data-32a03079311f Design & Data: how to humanize data how, data
https://medium.com/swlh/fyre-festival-achieved-perfect-product-market-fit-and-thats-why-we-should-question-the-lean-a6a45fcb735a Fyre Festival achieved perfect product-market fit (and that's why we should question the Lean Startup and VC dogma) question
https://uxdesign.cc/using-a-sneak-attack-question-during-your-designer-interviews-b918ff600977 Using a ‘sneak attack’ question during your Designer interviews question
https://medium.com/swlh/a-question-you-may-never-have-asked-culture-fit-or-culture-add-cf65b00770cf A question you may never have asked — culture fit or culture add? question

Notebook with the code: Extract list of keywords from a column in Pandas

Step 1: Read test DataFrame from Kaggle

The DataFrame above is available from Kaggle.

If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame

For this article we will use next code to download and read it:

import kaggle

kaggle.api.authenticate()
kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv',  path='data/')

read it by:

import pandas as pd
df = pd.read_csv('data/medium_data.csv.zip')

Step 2: Extract list of keywords from a column to new column

At this point we will define the list of keywords which we like to extract:

keywords = ['how', 'data', 'question', 'guide']

Then we are going to perform the extraction and the addition to a new column:

df['keyword'] = df['title'].str.findall('|'.join(keywords)).apply(set).str.join(', ')

At this point the new column keyword will contain all keywords found separated by a comma.

Step 3: Extract list of keywords to multiple columns

To extract the list of keywords to different columns use the next syntax:

for keyword in keywords:
    df[keyword] = df['title'].str.contains(keyword)

This will iterate over the list of the all columns and create a new column with True or False if the word exists or not.

If you like to style your DataFrame in the same way please check: How to style boolean values by different colors in Pandas

Final result can be found on the image below: