In this tutorial, I'll show you how to compare titles and URLs with Pandas and Python. We are going to use several Pandas functions and techniques in order to find problematic data.

To start, here is the initial Data and what is our final goal:

title url
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8
How to Use ggplot2 in Python https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129
Databricks: How to Save Files in CSV on Your Local Computer https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab
A Step-by-Step Implementation of Gradient Descent and Backpropagation https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110

Usually the slug of the URL is composed from the title after several transformations.

Below you can find example pair of URL and title:

Our goal is to find all articles which have mismatch between the title and the slug like:

The difference above is for vs of.

In the next section, I’ll review the steps to compare and validate the title against the URL of a given article.

Step 1: Read the data and install slugify

In this tutorial we are going to use data available from Kaggle. You can find it also in the repository related to the notebook(in Resources):

import pandas as pd
df = pd.read_csv('../data/medium_data.csv.zip')
df

If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame

We need to install Python library called: slugify:

pip install slugify

Step 2: Convert title to slug with slugify

Next step is to prepare the titles in order to compare them with the URLs. For this purpose we are going to use library slugify:

from slugify import slugify

df['title_to_url'] = df['title'].fillna('').apply(lambda x: slugify(x))

After this operation we will have the next data:

title url title_to_url
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92 a-beginners-guide-to-word-embedding-with-gensim-word2vec-model
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8 hands-on-graph-neural-networks-with-pytorch-pytorch-geometric
How to Use ggplot2 in Python https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129 how-to-use-ggplot2-in-python
Databricks: How to Save Files in CSV on Your Local Computer https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab databricks-how-to-save-files-in-csv-on-your-local-computer
A Step-by-Step Implementation of Gradient Descent and Backpropagation https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110 a-step-by-step-implementation-of-gradient-descent-and-backpropagation

Step 3: Find all rows where the slugified titles is not in the URL

In this step we are going to compare the slugified titles and the URLs. This is a pretty generic comparison and we might miss some false positives.

We are going to create a new column - bad_title where all values will be set to False.

Only if the slugified title is not part of the URL then we are going to set it to True:

df['bad_title'] = False

for row in df[df.title.notna()].iterrows():
    title_to_url_temp = row[1].title_to_url.lower()
    title_temp = row[1].title.replace('-', '').replace(' ', '').lower()
    if not title_to_url_temp in row[1]['url']:
        df.loc[row[0], 'bad_title'] = True

In total - 1051 results like:

title url title_to_url
<em class="markup--em markup--h3-em">What I Learned from (Two-time) Kaggle Grandmaster Abhishek Thakur</em> https://towardsdatascience.com/what-i-learned-from-abhishek-thakur-4b905ac0fd55 em-class-markup-em-markup-h3-em-what-i-learned-from-two-time-kaggle-grandmaster-abhishek-thakur-em
Faster Training for Efficient CNNs https://towardsdatascience.com/faster-training-of-efficient-cnns-657953aa080 faster-training-for-efficient-cnns
Buyers beware, fake product reviews are plaguing the internet. How Machine Learning can help to spot them. https://towardsdatascience.com/buyers-beware-fake-product-reviews-are-plaguing-the-internet-cfc599c42b6b buyers-beware-fake-product-reviews-are-plaguing-the-internet-how-machine-learning-can-help-to-spot-them

You can use different techniques in order to get better results like:

  • replace hyphens from both sides
  • convert to lower case

Step 4: Extract the slug from the URL and compare to title

In this step we are going to extract the slug from the URL. This is not a simple operation and might depend on many factors.

We can't replace the domain or extract the slug with simple split:

df['url'].str.split('/', expand = True)

Result:

0 1 2 3 4
https: medium.com swlh how-to-go-from-zero-to-hero-through-personal-branding-e5edab472dfa
https: medium.com swlh five-augmented-reality-uses-that-solve-real-life-problems-9e7ff77be858
https: towardsdatascience.com everything-you-need-to-know-about-autoencoders-in-tensorflow-b6a63e8255f0 None
https: medium.com datadriveninvestor electric-company-cars-get-a-tax-incentive-in-germany-a0c207da0bb9
https: uxdesign.cc ux-is-not-a-role-it-is-a-title-b072aff1c98c None

As you can see there are multiple domains and formats.

What we are going to do is extracting the slug by chaining several Pandas operations:

df['url'].str.rsplit('/').str[-1].str.rsplit('-', 1, expand=True)[0]

**How does it work? **

Starting with -
https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129

We are using the method rsplit in order to split on /.

[https:, , towardsdatascience.com, how-to-use-ggplot2-in-python-74ab8adec129]

Then adding str[-1] in order to get the part after the last backslash:

how-to-use-ggplot2-in-python-74ab8adec129

In order to get rid of that final identifier - -74ab8adec129 we will use rsplit again - .rsplit('-', 1, expand=True)[0]

Now we can compare the slug from the URL with the modified title:

df[df['url'].str.rsplit('/').str[-1].str.rsplit('-', 1, expand=True)[0] != df['title_to_url']]

The result is pretty similar to the one from Step 3 - but we have more results - 1119.

But there is a difference between both methods. This step is more accurate and will report cases like:

  • Title: Living as an Empath — w
  • URL: https://medium.com/swlh/living-as-an-empath-when-you-feel-everything-and-nobody-seems-to-understand-65fecce2aa03
  • title_to_url: living-as-an-empath-w

while Step 3 will miss this result since the title_to_url is part of the URL.

Resources