In this short guide, I'll show you the steps to extract a list of keywords matched in a Pandas DataFrame and create new column(s).

In particular, I'll show you how to return keywords from a given column.

The image below shows what is the final outcome:


To start with a simple example, let's say that you have the next Pandas DataFrame:

url title keyword Training on batch: how to split data effectively? how, data Design & Data: how to humanize data how, data Fyre Festival achieved perfect product-market fit (and that's why we should question the Lean Startup and VC dogma) question Using a ‘sneak attack’ question during your Designer interviews question A question you may never have asked — culture fit or culture add? question

Notebook with the code: Extract list of keywords from a column in Pandas

Step 1: Read test DataFrame from Kaggle

The DataFrame above is available from Kaggle.

If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame

For this article we will use next code to download and read it:

import kaggle

kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv',  path='data/')

read it by:

import pandas as pd
df = pd.read_csv('data/')

Step 2: Extract list of keywords from a column to new column

At this point we will define the list of keywords which we like to extract:

keywords = ['how', 'data', 'question', 'guide']

Then we are going to perform the extraction and the addition to a new column:

df['keyword'] = df['title'].str.findall('|'.join(keywords)).apply(set).str.join(', ')

At this point the new column keyword will contain all keywords found separated by a comma.

Step 3: Extract list of keywords to multiple columns

To extract the list of keywords to different columns use the next syntax:

for keyword in keywords:
    df[keyword] = df['title'].str.contains(keyword)

This will iterate over the list of the all columns and create a new column with True or False if the word exists or not.

If you like to style your DataFrame in the same way please check: How to style boolean values by different colors in Pandas

Final result can be found on the image below: