How to Extract Domain from URL in Pandas

In this short guide, I'll show you how to extract domain from a URL column in Pandas DataFrame. You can also find how to extract netloc, schema, path, params. So at the end you will get:

['https://www.datascientyst.com/cheatsheet','https://www.softhints.com/python']

to:

0    (https, www.datascientyst.com, /cheatsheet, , , )
1            (https, www.softhints.com, /python, , , )
Name: urls, dtype: object

or extracting only domains:

0    www.datascientyst.com
1        www.softhints.com
Name: urls, dtype: object

Setup

Let's have DataFrame with URL column from which we will extract list of domains:

import pandas as pd

data = {'urls': ['https://www.datascientyst.com/cheatsheet','https://www.softhints.com/python']}

df = pd.DataFrame(data)

DataFrame looks like:

	urls
0	https://www.datascientyst.com/cheatsheet
1	https://www.softhints.com/python

Step 1: Extract domain from URL - urlparse

First way to extract domain from URL in Python is library - urlparse:

from urllib.parse import urlparse

df['urls'].apply(urlparse)

This will extract all information as Series of tuples:

0    (https, www.datascientyst.com, /cheatsheet, , , )
1            (https, www.softhints.com, /python, , , )
Name: urls, dtype: object

To extract only the netloc or the domain we can use:

df['urls'].apply(lambda x: urlparse(x)[1])

extracted netloc-s from the URL column:

0    www.datascientyst.com
1        www.softhints.com
Name: urls, dtype: object

We can extract the full ParseResult from the urlparse library by:

df['paths'] = df['urls'].apply(lambda x: urlparse(x))
df['paths'].to_dict()

result is full URL information:

{0: ParseResult(scheme='https', netloc='www.datascientyst.com', path='/cheatsheet', params='', query='', fragment=''),
 1: ParseResult(scheme='https', netloc='www.softhints.com', path='/python', params='', query='', fragment='')}

Step 2: Extract domain from URL - regex

We can use regular expression in order to extract patterns from the URL columns. Pandas offers .str.extract method:

df['urls'].str.extract(r'https://(.*)/')

would extract the domains plus the subdomains if any:

	0
0	www.datascientyst.com
1	www.softhints.com

Or we can match up to symbol without extracting the symbol. In this case we will search for anything until we match char t:

df['urls'].str.extract(r'(https.*)(?:t)').head()

result:

	0
0	https://www.datascientyst.com/cheatshee
1	https://www.softhints.com/py

Conclusion

We saw two different ways how to parse URL information with Python and Pandas.

We can extract from Pandas DataFrame information like:

scheme='https'
netloc='www.datascientyst.com'
path='/cheatsheet'
params=''
query=''
fragment=''

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

Setup

Step 1: Extract domain from URL - urlparse

Step 2: Extract domain from URL - regex

Conclusion