In this quick tutorial, we'll show how to replace values with regex in Pandas DataFrame. There are several options to replace a value in a column or the whole DataFrame with regex:

  1. Regex replace string
df['applicants'].str.replace(r'\sapplicants', '')
  1. Regex replace capture group
df['applicants'].replace(to_replace=r"([0-9,\.]+)(.*)", value=r"\1", regex=True)
  1. Regex replace special characters - r'[^0-9a-zA-Z:,\s]+' - including spaces
df['internship'].str.replace(r'[^0-9a-zA-Z:,]+', '')
  1. Regex replace numbers or non-digit characters
df['applicants'].str.replace(r'\D+', '')

In the next part of the post, you'll see the steps and practical examples on how to use regex and replace in Pandas.

Step 1: Create Sample DataFrame

First, let's create a sample 'dirty' data which needs to be cleaned and replaced:

import pandas as pd

df = pd.read_csv(f'../data/internshala_dataset_raw.csv')
df
internship applicants stipend duration
Mobile App Development 49 applicants 15000-25000 /month 6 Months
React/React JS 35 applicants 5000 /month 4 Months
ReactJS Development Be an early applicant 5000 /month 6 Months
Video Online Course Development (Python) 63 applicants 1000 /month 1 Month
Internet Of Things (IoT) 140 applicants 6000 /month 6 Months

Data is real and available from Kaggle: Internship posted in computer science category.

To learn more about reading Kaggle data with Python and Pandas: How to Search and Download Kaggle Dataset to Pandas DataFrame

Step 2: Replace String Values with Regex in Column

Let's start with replacing string values in column applicants. As you can see the values in the column are mixed.

There are two options:

Replace single string value

df['applicants'].str.replace(r'\sapplicants', '', regex=True)

The result of this operation will be a Pandas Series:

['49', '35', 'Be an early applicant', '63', '140']

There are some values which we would like to replace too. This will be done in the next section

Replace multiple string value

To replace multiple values with regex in Pandas we can use the following syntax: r'(\sapplicants|Be an early applicant)' - where the values to replaced are separated by pipes - |

df_temp['applicants'].str.replace(r'(\sapplicants|Be an early applicant)', '', regex=True).to_list()

result:

['49', '35', '', '63', '140']

Replace multiple string values with different replacement

Finally for this step let's see how we can replace multiple values with different values. Let say that we would like to replace:

  • Be an early applicant with 0
  • 49 applicants with 49

in this case we can use the method replace and pass a list of values to be replaced and list of values of the replacement:

df_temp['applicants'].replace([r'(\d+) applicants', 'Be an early applicant'],[r'\1',0], regex=True)

result:

['49', '35', 0, '63', '140']

In the last example we saw using a capture group which we are going to check in the next step.

Step 3: Regex replace with capture group

In this step we will take a deeper look on regex and capture groups in Pandas.

They are powerful tool to match a pattern and extract only part of it.

Let's say that we would like to match : 63 applicants but only extract the numbers. In other words, to search for a numeric sequence followed by anything.

We are capturing two groups but will keep only the first one - the numbers.

df['applicants'].replace(to_replace=r"([0-9,\.]+)(.*)", value=r"\1", regex=True)

the result would be:

['49', '35', 'Be an early applicant', '63', '140']

As you can see the code works as expected in case of a match. Otherwise it will keep the value.

Step 4: Regex replace only special characters

What if we would like to clean or remove all special characters while keeping numbers and letters.

In that case we can use one of the next regex:

  • r'[^0-9a-zA-Z:,\s]+' - keep numbers, letters, semicolon, comma and space
  • r'[^0-9a-zA-Z:,]+' - keep numbers, letters, semicolon and comma

So the code looks like:

df['internship'].str.replace(r'[^0-9a-zA-Z:,\s]+', '', regex=True)

Will change: Internet Of Things (IoT) to Internet Of Things IoT

Step 5: Regex replace numbers or non-digit characters

Now let's check how we can** replace all non digit characters and convert the value to int or remove all numbers from a column**.

Replace all non numeric symbols and map in case of missing

In this example we are going to replace everything which is not a number with a regex. In case of a value which doesn't have a number we will map the value to 0.

df_temp['applicants'].str.replace(r'\D+', '', regex=True).replace({'':0}).astype('int')

This is done on three simple steps:

  • first replace all non numeric symbols - str.replace(r'\D+', '', regex=True)
  • second - in case of missing numbers - empty string is returned - map the empty string to 0 by .replace({'':0})
  • convert to numeric column

Replace all numbers from Pandas column

To replace all numbers from a given column you can use the next syntax:

df['applicants'].replace(to_replace=r"\d+", value=r" ", regex=True)

result:

['  applicants',
 '  applicants',
 'Be an early applicant',
 '  applicants',
 '  applicants']

Step 6: Regex replace all values in DataFrame

Finally let's find out how to replace values in the whole DataFrame - all columns and rows - with a single line of code:

To replace all numbers from a DataFrame use method replace - directly on the DataFrame:

df_temp.replace(to_replace=r"\d+", value=r" ", regex=True)

result:

internship applicants stipend duration
Mobile App Development applicants - /month Months
React/React JS applicants /month Months
ReactJS Development Be an early applicant /month Months
Video Online Course Development (Python) applicants /month Month
Internet Of Things (IoT) applicants /month Months

Resources