How to Extract Domain from URL in Pandas
In this short guide, I'll show you how to extract domain from a URL column in Pandas DataFrame. You can also find how to extract netloc, schema, path, params. So at the end you will get:
['https://www.datascientyst.com/cheatsheet','https://www.softhints.com/python']
to:
0 (https, www.datascientyst.com, /cheatsheet, , , )
1 (https, www.softhints.com, /python, , , )
Name: urls, dtype: object
or extracting only domains:
0 www.datascientyst.com
1 www.softhints.com
Name: urls, dtype: object
Setup
Let's have DataFrame with URL column from which we will extract list of domains:
import pandas as pd
data = {'urls': ['https://www.datascientyst.com/cheatsheet','https://www.softhints.com/python']}
df = pd.DataFrame(data)
DataFrame looks like:
urls | |
---|---|
0 | https://www.datascientyst.com/cheatsheet |
1 | https://www.softhints.com/python |
Step 1: Extract domain from URL - urlparse
First way to extract domain from URL in Python is library - urlparse
:
from urllib.parse import urlparse
df['urls'].apply(urlparse)
This will extract all information as Series of tuples:
0 (https, www.datascientyst.com, /cheatsheet, , , )
1 (https, www.softhints.com, /python, , , )
Name: urls, dtype: object
To extract only the netloc or the domain we can use:
df['urls'].apply(lambda x: urlparse(x)[1])
extracted netloc-s from the URL column:
0 www.datascientyst.com
1 www.softhints.com
Name: urls, dtype: object
We can extract the full ParseResult from the urlparse library by:
df['paths'] = df['urls'].apply(lambda x: urlparse(x))
df['paths'].to_dict()
result is full URL information:
{0: ParseResult(scheme='https', netloc='www.datascientyst.com', path='/cheatsheet', params='', query='', fragment=''),
1: ParseResult(scheme='https', netloc='www.softhints.com', path='/python', params='', query='', fragment='')}
Step 2: Extract domain from URL - regex
We can use regular expression in order to extract patterns from the URL columns. Pandas offers .str.extract
method:
df['urls'].str.extract(r'https://(.*)/')
would extract the domains plus the subdomains if any:
0 | |
---|---|
0 | www.datascientyst.com |
1 | www.softhints.com |
Or we can match up to symbol without extracting the symbol. In this case we will search for anything until we match char t
:
df['urls'].str.extract(r'(https.*)(?:t)').head()
result:
0 | |
---|---|
0 | https://www.datascientyst.com/cheatshee |
1 | https://www.softhints.com/py |
Conclusion
We saw two different ways how to parse URL information with Python and Pandas.
We can extract from Pandas DataFrame information like:
- scheme='https'
- netloc='www.datascientyst.com'
- path='/cheatsheet'
- params=''
- query=''
- fragment=''