How to replace values with regex in Pandas

In this quick tutorial, we'll show how to replace values with regex in Pandas DataFrame. There are several options to replace a value in a column or the whole DataFrame with regex:

Regex replace string

df['applicants'].str.replace(r'\sapplicants', '')

Regex replace capture group

df['applicants'].replace(to_replace=r"([0-9,\.]+)(.*)", value=r"\1", regex=True)

Regex replace special characters - r'[^0-9a-zA-Z:,\s]+' - including spaces

df['internship'].str.replace(r'[^0-9a-zA-Z:,]+', '')

Regex replace numbers or non-digit characters

df['applicants'].str.replace(r'\D+', '')

In the next part of the post, you'll see the steps and practical examples on how to use regex and replace in Pandas.

Step 1: Create Sample DataFrame

First, let's create a sample 'dirty' data which needs to be cleaned and replaced:

import pandas as pd

df = pd.read_csv(f'../data/internshala_dataset_raw.csv')
df

internship	applicants	stipend	duration
Mobile App Development	49 applicants	15000-25000 /month	6 Months
React/React JS	35 applicants	5000 /month	4 Months
ReactJS Development	Be an early applicant	5000 /month	6 Months
Video Online Course Development (Python)	63 applicants	1000 /month	1 Month
Internet Of Things (IoT)	140 applicants	6000 /month	6 Months

Data is real and available from Kaggle: Internship posted in computer science category.

To learn more about reading Kaggle data with Python and Pandas: How to Search and Download Kaggle Dataset to Pandas DataFrame

Step 2: Replace String Values with Regex in Column

Let's start with replacing string values in column applicants. As you can see the values in the column are mixed.

There are two options:

Replace single string value

df['applicants'].str.replace(r'\sapplicants', '', regex=True)

The result of this operation will be a Pandas Series:

['49', '35', 'Be an early applicant', '63', '140']

There are some values which we would like to replace too. This will be done in the next section

Replace multiple string value

To replace multiple values with regex in Pandas we can use the following syntax: r'(\sapplicants|Be an early applicant)' - where the values to replaced are separated by pipes - |

df_temp['applicants'].str.replace(r'(\sapplicants|Be an early applicant)', '', regex=True).to_list()

result:

['49', '35', '', '63', '140']

Replace multiple string values with different replacement

Finally for this step let's see how we can replace multiple values with different values. Let say that we would like to replace:

Be an early applicant with 0
49 applicants with 49

in this case we can use the method replace and pass a list of values to be replaced and list of values of the replacement:

df_temp['applicants'].replace([r'(\d+) applicants', 'Be an early applicant'],[r'\1',0], regex=True)

result:

['49', '35', 0, '63', '140']

In the last example we saw using a capture group which we are going to check in the next step.

Step 3: Regex replace with capture group

In this step we will take a deeper look on regex and capture groups in Pandas.

They are powerful tool to match a pattern and extract only part of it.

Let's say that we would like to match : 63 applicants but only extract the numbers. In other words, to search for a numeric sequence followed by anything.

We are capturing two groups but will keep only the first one - the numbers.

df['applicants'].replace(to_replace=r"([0-9,\.]+)(.*)", value=r"\1", regex=True)

the result would be:

['49', '35', 'Be an early applicant', '63', '140']

As you can see the code works as expected in case of a match. Otherwise it will keep the value.

Step 4: Regex replace only special characters

What if we would like to clean or remove all special characters while keeping numbers and letters.

In that case we can use one of the next regex:

r'[^0-9a-zA-Z:,\s]+' - keep numbers, letters, semicolon, comma and space
r'[^0-9a-zA-Z:,]+' - keep numbers, letters, semicolon and comma

So the code looks like:

df['internship'].str.replace(r'[^0-9a-zA-Z:,\s]+', '', regex=True)

Will change: Internet Of Things (IoT) to Internet Of Things IoT

Step 5: Regex replace numbers or non-digit characters

Now let's check how we can** replace all non digit characters and convert the value to int or remove all numbers from a column**.

Replace all non numeric symbols and map in case of missing

In this example we are going to replace everything which is not a number with a regex. In case of a value which doesn't have a number we will map the value to 0.

df_temp['applicants'].str.replace(r'\D+', '', regex=True).replace({'':0}).astype('int')

This is done on three simple steps:

first replace all non numeric symbols - str.replace(r'\D+', '', regex=True)
second - in case of missing numbers - empty string is returned - map the empty string to 0 by .replace({'':0})
convert to numeric column

Replace all numbers from Pandas column

To replace all numbers from a given column you can use the next syntax:

df['applicants'].replace(to_replace=r"\d+", value=r" ", regex=True)

result:

['  applicants',
 '  applicants',
 'Be an early applicant',
 '  applicants',
 '  applicants']

Step 6: Regex replace all values in DataFrame

Finally let's find out how to replace values in the whole DataFrame - all columns and rows - with a single line of code:

To replace all numbers from a DataFrame use method replace - directly on the DataFrame:

df_temp.replace(to_replace=r"\d+", value=r" ", regex=True)

result:

internship	applicants	stipend	duration
Mobile App Development	applicants	- /month	Months
React/React JS	applicants	/month	Months
ReactJS Development	Be an early applicant	/month	Months
Video Online Course Development (Python)	applicants	/month	Month
Internet Of Things (IoT)	applicants	/month	Months

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

Step 1: Create Sample DataFrame

Step 2: Replace String Values with Regex in Column

Replace single string value

Replace multiple string value

Replace multiple string values with different replacement

Step 3: Regex replace with capture group

Step 4: Regex replace only special characters

Step 5: Regex replace numbers or non-digit characters

Replace all non numeric symbols and map in case of missing

Replace all numbers from Pandas column

Step 6: Regex replace all values in DataFrame

Resources