In this quick tutorial, we'll show how to replace values with regex in Pandas DataFrame. There are several options to replace a value in a column or the whole DataFrame with regex:
- Regex replace string
df['applicants'].str.replace(r'\sapplicants', '')
- Regex replace capture group
df['applicants'].replace(to_replace=r"([0-9,\.]+)(.*)", value=r"\1", regex=True)
- Regex replace special characters -
r'[^0-9a-zA-Z:,\s]+'
- including spaces
df['internship'].str.replace(r'[^0-9a-zA-Z:,]+', '')
- Regex replace numbers or non-digit characters
df['applicants'].str.replace(r'\D+', '')
In the next part of the post, you'll see the steps and practical examples on how to use regex and replace in Pandas.
Step 1: Create Sample DataFrame
First, let's create a sample 'dirty' data which needs to be cleaned and replaced:
import pandas as pd
df = pd.read_csv(f'../data/internshala_dataset_raw.csv')
df
internship | applicants | stipend | duration |
---|---|---|---|
Mobile App Development | 49 applicants | 15000-25000 /month | 6 Months |
React/React JS | 35 applicants | 5000 /month | 4 Months |
ReactJS Development | Be an early applicant | 5000 /month | 6 Months |
Video Online Course Development (Python) | 63 applicants | 1000 /month | 1 Month |
Internet Of Things (IoT) | 140 applicants | 6000 /month | 6 Months |
Data is real and available from Kaggle: Internship posted in computer science category.
To learn more about reading Kaggle data with Python and Pandas: How to Search and Download Kaggle Dataset to Pandas DataFrame
Step 2: Replace String Values with Regex in Column
Let's start with replacing string values in column applicants
. As you can see the values in the column are mixed.
There are two options:
Replace single string value
df['applicants'].str.replace(r'\sapplicants', '', regex=True)
The result of this operation will be a Pandas Series:
['49', '35', 'Be an early applicant', '63', '140']
There are some values which we would like to replace too. This will be done in the next section
Replace multiple string value
To replace multiple values with regex in Pandas we can use the following syntax: r'(\sapplicants|Be an early applicant)'
- where the values to replaced are separated by pipes - |
df_temp['applicants'].str.replace(r'(\sapplicants|Be an early applicant)', '', regex=True).to_list()
result:
['49', '35', '', '63', '140']
Replace multiple string values with different replacement
Finally for this step let's see how we can replace multiple values with different values. Let say that we would like to replace:
Be an early applicant
with 049 applicants
with 49
in this case we can use the method replace
and pass a list of values to be replaced and list of values of the replacement:
df_temp['applicants'].replace([r'(\d+) applicants', 'Be an early applicant'],[r'\1',0], regex=True)
result:
['49', '35', 0, '63', '140']
In the last example we saw using a capture group which we are going to check in the next step.
Step 3: Regex replace with capture group
In this step we will take a deeper look on regex and capture groups in Pandas.
They are powerful tool to match a pattern and extract only part of it.
Let's say that we would like to match : 63 applicants
but only extract the numbers. In other words, to search for a numeric sequence followed by anything.
We are capturing two groups but will keep only the first one - the numbers.
df['applicants'].replace(to_replace=r"([0-9,\.]+)(.*)", value=r"\1", regex=True)
the result would be:
['49', '35', 'Be an early applicant', '63', '140']
As you can see the code works as expected in case of a match. Otherwise it will keep the value.
Step 4: Regex replace only special characters
What if we would like to clean or remove all special characters while keeping numbers and letters.
In that case we can use one of the next regex:
r'[^0-9a-zA-Z:,\s]+'
- keep numbers, letters, semicolon, comma and spacer'[^0-9a-zA-Z:,]+'
- keep numbers, letters, semicolon and comma
So the code looks like:
df['internship'].str.replace(r'[^0-9a-zA-Z:,\s]+', '', regex=True)
Will change: Internet Of Things (IoT)
to Internet Of Things IoT
Step 5: Regex replace numbers or non-digit characters
Now let's check how we can** replace all non digit characters and convert the value to int or remove all numbers from a column**.
Replace all non numeric symbols and map in case of missing
In this example we are going to replace everything which is not a number with a regex. In case of a value which doesn't have a number we will map the value to 0.
df_temp['applicants'].str.replace(r'\D+', '', regex=True).replace({'':0}).astype('int')
This is done on three simple steps:
- first replace all non numeric symbols -
str.replace(r'\D+', '', regex=True)
- second - in case of missing numbers - empty string is returned - map the empty string to 0 by
.replace({'':0})
- convert to numeric column
Replace all numbers from Pandas column
To replace all numbers from a given column you can use the next syntax:
df['applicants'].replace(to_replace=r"\d+", value=r" ", regex=True)
result:
[' applicants',
' applicants',
'Be an early applicant',
' applicants',
' applicants']
Step 6: Regex replace all values in DataFrame
Finally let's find out how to replace values in the whole DataFrame - all columns and rows - with a single line of code:
To replace all numbers from a DataFrame use method replace
- directly on the DataFrame:
df_temp.replace(to_replace=r"\d+", value=r" ", regex=True)
result:
internship | applicants | stipend | duration |
---|---|---|---|
Mobile App Development | applicants | - /month | Months |
React/React JS | applicants | /month | Months |
ReactJS Development | Be an early applicant | /month | Months |
Video Online Course Development (Python) | applicants | /month | Month |
Internet Of Things (IoT) | applicants | /month | Months |