Working with text data in Pandas, you may need to split strings based on a n-th occuramce of delimiter and extract specific parts.
This is useful for parsing URLs, file paths, or structured data. Pandas provides efficient ways to handle such operations with str.split()
and expand=True
.
(1) Split the string at the nth occurrence and keep the first part
df['column_name'].str.split('-', n=2).str[0]
(2) Extract the nth part from a split string
df['column_name'].str.split('-', expand=True)[2]
1. Sample data
import pandas as pd
data = ['https://example.com/search?q=avatar',
'https://example.com/profile/avatar',
'https://example.com/map']
df = pd.DataFrame({'text': data})
df
data looks like:
text | first_three | |
---|---|---|
0 | https://example.com/search?q=avatar | https: |
1 | https://example.com/profile/avatar | https: |
2 | https://example.com/map | https: |
2. Splitting a String at the n-th Occurrence
To split a string only at the n-th occurrence of a delimiter, use the n
parameter of str.split()
. We can extract the last part of the URL by:
df['text'].str.split('/', n=3, expand=True)[3]
Output:
0 search?q=avatar
1 profile/avatar
2 map
Name: 3, dtype: object
n=3
ensures only 3 splits occur[3]
extracts the 3rd part of the split
below you can find the resulted dataframe from the split:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | https: | example.com | search?q=avatar | |
1 | https: | example.com | profile/avatar | |
2 | https: | example.com | map |
3. Extracting the nth Element from a Split String
If you need to extract the nth part of the split string, use expand=True
to create multiple columns.
df[['protocol', 'empty', 'domain', 'method', 'param']] = df['text'].str.split('/', expand=True)
df
Output:
text | first_three | protocol | empty | domain | method | param | |
---|---|---|---|---|---|---|---|
0 | https://example.com/search?q=avatar | https: | https: | example.com | search?q=avatar | None | |
1 | https://example.com/profile/avatar | https: | https: | example.com | profile | avatar | |
2 | https://example.com/map | https: | https: | example.com | map | None |
4. Keeping Only the Last Two Parts of a Split String
For cases like domain extraction (example.com
, www.example.com
), keep only the last two parts. Or keep the domain and the method from URL:
df['text'].apply(lambda x: '/'.join(x.split('/')[-2:]))
Output:
0 example.com/search?q=avatar
1 profile/avatar
2 example.com/map
Name: text, dtype: object
x.split('/')[-2:]
keeps only the last two elements.'/'.join(...)
reconstructs the truncated string.
5. Conclusion
Pandas provides multiple ways to split strings based on the nth occurrence of a delimiter. Whether you need to keep a portion of the string, extract a specific element, or retain only the last few parts, str.split()
and apply()
are effective tools for data transformation.