How to Compare Titles and URLs in Pandas

In this tutorial, I'll show you how to compare titles and URLs with Pandas and Python. We are going to use several Pandas functions and techniques in order to find problematic data.

To start, here is the initial Data and what is our final goal:

title	url
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model	https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric	https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8
How to Use ggplot2 in Python	https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129
Databricks: How to Save Files in CSV on Your Local Computer	https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab
A Step-by-Step Implementation of Gradient Descent and Backpropagation	https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110

Usually the slug of the URL is composed from the title after several transformations.

Below you can find example pair of URL and title:

https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model

Our goal is to find all articles which have mismatch between the title and the slug like:

Faster Training for Efficient CNNs
- faster-training-for-efficient-cnns
https://towardsdatascience.com/faster-training-of-efficient-cnns-657953aa080

The difference above is for vs of.

In the next section, I’ll review the steps to compare and validate the title against the URL of a given article.

Step 1: Read the data and install slugify

In this tutorial we are going to use data available from Kaggle. You can find it also in the repository related to the notebook(in Resources):

import pandas as pd
df = pd.read_csv('../data/medium_data.csv.zip')
df

If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame

We need to install Python library called: slugify:

pip install slugify

Step 2: Convert title to slug with slugify

Next step is to prepare the titles in order to compare them with the URLs. For this purpose we are going to use library slugify:

from slugify import slugify

df['title_to_url'] = df['title'].fillna('').apply(lambda x: slugify(x))

After this operation we will have the next data:

title	url	title_to_url
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model	https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92	a-beginners-guide-to-word-embedding-with-gensim-word2vec-model
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric	https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8	hands-on-graph-neural-networks-with-pytorch-pytorch-geometric
How to Use ggplot2 in Python	https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129	how-to-use-ggplot2-in-python
Databricks: How to Save Files in CSV on Your Local Computer	https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab	databricks-how-to-save-files-in-csv-on-your-local-computer
A Step-by-Step Implementation of Gradient Descent and Backpropagation	https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110	a-step-by-step-implementation-of-gradient-descent-and-backpropagation

Step 3: Find all rows where the slugified titles is not in the URL

In this step we are going to compare the slugified titles and the URLs. This is a pretty generic comparison and we might miss some false positives.

We are going to create a new column - bad_title where all values will be set to False.

Only if the slugified title is not part of the URL then we are going to set it to True:

df['bad_title'] = False

for row in df[df.title.notna()].iterrows():
    title_to_url_temp = row[1].title_to_url.lower()
    title_temp = row[1].title.replace('-', '').replace(' ', '').lower()
    if not title_to_url_temp in row[1]['url']:
        df.loc[row[0], 'bad_title'] = True

In total - 1051 results like:

title	url	title_to_url
<em class="markup--em markup--h3-em">What I Learned from (Two-time) Kaggle Grandmaster Abhishek Thakur</em>	https://towardsdatascience.com/what-i-learned-from-abhishek-thakur-4b905ac0fd55	em-class-markup-em-markup-h3-em-what-i-learned-from-two-time-kaggle-grandmaster-abhishek-thakur-em
Faster Training for Efficient CNNs	https://towardsdatascience.com/faster-training-of-efficient-cnns-657953aa080	faster-training-for-efficient-cnns
Buyers beware, fake product reviews are plaguing the internet. How Machine Learning can help to spot them.	https://towardsdatascience.com/buyers-beware-fake-product-reviews-are-plaguing-the-internet-cfc599c42b6b	buyers-beware-fake-product-reviews-are-plaguing-the-internet-how-machine-learning-can-help-to-spot-them

You can use different techniques in order to get better results like:

replace hyphens from both sides
convert to lower case

Step 4: Extract the slug from the URL and compare to title

In this step we are going to extract the slug from the URL. This is not a simple operation and might depend on many factors.

We can't replace the domain or extract the slug with simple split:

df['url'].str.split('/', expand = True)

Result:

0	2	3	4
https:	medium.com	swlh	how-to-go-from-zero-to-hero-through-personal-branding-e5edab472dfa
https:	medium.com	swlh	five-augmented-reality-uses-that-solve-real-life-problems-9e7ff77be858
https:	towardsdatascience.com	everything-you-need-to-know-about-autoencoders-in-tensorflow-b6a63e8255f0	None
https:	medium.com	datadriveninvestor	electric-company-cars-get-a-tax-incentive-in-germany-a0c207da0bb9
https:	uxdesign.cc	ux-is-not-a-role-it-is-a-title-b072aff1c98c	None

As you can see there are multiple domains and formats.

What we are going to do is extracting the slug by chaining several Pandas operations:

df['url'].str.rsplit('/').str[-1].str.rsplit('-', 1, expand=True)[0]

**How does it work? **

Starting with -
https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129

We are using the method rsplit in order to split on /.

[https:, , towardsdatascience.com, how-to-use-ggplot2-in-python-74ab8adec129]

Then adding str[-1] in order to get the part after the last backslash:

how-to-use-ggplot2-in-python-74ab8adec129

In order to get rid of that final identifier - -74ab8adec129 we will use rsplit again - .rsplit('-', 1, expand=True)[0]

Now we can compare the slug from the URL with the modified title:

df[df['url'].str.rsplit('/').str[-1].str.rsplit('-', 1, expand=True)[0] != df['title_to_url']]

The result is pretty similar to the one from Step 3 - but we have more results - 1119.

But there is a difference between both methods. This step is more accurate and will report cases like:

Title: Living as an Empath — w
URL: https://medium.com/swlh/living-as-an-empath-when-you-feel-everything-and-nobody-seems-to-understand-65fecce2aa03
title_to_url: living-as-an-empath-w

while Step 3 will miss this result since the title_to_url is part of the URL.