How to Extract Everything Before or After with Regex in Pandas
To plot two variables on two sides of Y-axes, we can plot in two steps:
'(.*?)\n'
'.+?(?=\n)'
Steps to extract everything until/after
Below are the steps which I usually follow for regex extraction in Pandas
- analyse the data from which I will extract
- clean the data
- choose pandas method -
split
,extract
etc - define regex pattern
- create new column(s)
Data
Let's create simple sample DataFrame to be used for regex extraction:
from faker import Faker
import pandas as pd
Faker.seed(0)
fake = Faker()
addr = []
for _ in range(5):
addr.append(fake.address())
df = pd.DataFrame({'address':addr})
address | |
---|---|
0 | 48764 Howard Forge Apt. 421\nVanessaside, VT 79393 |
1 | PSC 4115, Box 7815\nAPO AA 41945 |
2 | 778 Brown Plaza\nNorth Jenniferfurt, VT 88077 |
3 | 3513 John Divide Suite 115\nRodriguezside, LA 93111 |
4 | 398 Wallace Ranch Suite 593\nIvanburgh, AZ 80818 |
Example 1 - Captcharing group and characters
Extract everything in Pandas column up to new line
df['address'].str.extract('(.*?)\n')
result:
0 | |
---|---|
0 | 48764 Howard Forge Apt. 421 |
1 | PSC 4115, Box 7815 |
2 | 778 Brown Plaza |
3 | 3513 John Divide Suite 115 |
4 | 398 Wallace Ranch Suite 593 |
Example 2 - Non captcharing groups
Extract everything in Pandas column up to new line
df['address'].str.extract('(.+)?(?=\n)')
result:
0 | |
---|---|
0 | 48764 Howard Forge Apt. 421 |
1 | PSC 4115, Box 7815 |
2 | 778 Brown Plaza |
3 | 3513 John Divide Suite 115 |
4 | 398 Wallace Ranch Suite 593 |