How to Compare Titles and URLs in Pandas
In this tutorial, I'll show you how to compare titles and URLs with Pandas and Python. We are going to use several Pandas functions and techniques in order to find problematic data.
To start, here is the initial Data and what is our final goal:
title | url |
---|---|
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model | https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92 |
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric | https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8 |
How to Use ggplot2 in Python | https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129 |
Databricks: How to Save Files in CSV on Your Local Computer | https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab |
A Step-by-Step Implementation of Gradient Descent and Backpropagation | https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110 |
Usually the slug of the URL is composed from the title after several transformations.
Below you can find example pair of URL and title:
- https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92
- A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model
Our goal is to find all articles which have mismatch between the title and the slug like:
- Faster Training for Efficient CNNs
- faster-training-for-efficient-cnns
- https://towardsdatascience.com/faster-training-of-efficient-cnns-657953aa080
The difference above is for
vs of
.
In the next section, I’ll review the steps to compare and validate the title against the URL of a given article.
Step 1: Read the data and install slugify
In this tutorial we are going to use data available from Kaggle. You can find it also in the repository related to the notebook(in Resources):
import pandas as pd
df = pd.read_csv('../data/medium_data.csv.zip')
df
If you like to learn more about how to read Kaggle as a Pandas DataFrame check this article: How to Search and Download Kaggle Dataset to Pandas DataFrame
We need to install Python library called: slugify
:
pip install slugify
Step 2: Convert title to slug with slugify
Next step is to prepare the titles in order to compare them with the URLs. For this purpose we are going to use library slugify
:
from slugify import slugify
df['title_to_url'] = df['title'].fillna('').apply(lambda x: slugify(x))
After this operation we will have the next data:
title | url | title_to_url |
---|---|---|
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model | https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92 | a-beginners-guide-to-word-embedding-with-gensim-word2vec-model |
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric | https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8 | hands-on-graph-neural-networks-with-pytorch-pytorch-geometric |
How to Use ggplot2 in Python | https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129 | how-to-use-ggplot2-in-python |
Databricks: How to Save Files in CSV on Your Local Computer | https://towardsdatascience.com/databricks-how-to-save-files-in-csv-on-your-local-computer-3d0c70e6a9ab | databricks-how-to-save-files-in-csv-on-your-local-computer |
A Step-by-Step Implementation of Gradient Descent and Backpropagation | https://towardsdatascience.com/a-step-by-step-implementation-of-gradient-descent-and-backpropagation-d58bda486110 | a-step-by-step-implementation-of-gradient-descent-and-backpropagation |
Step 3: Find all rows where the slugified titles is not in the URL
In this step we are going to compare the slugified titles and the URLs. This is a pretty generic comparison and we might miss some false positives.
We are going to create a new column - bad_title
where all values will be set to False
.
Only if the slugified title is not part of the URL then we are going to set it to True
:
df['bad_title'] = False
for row in df[df.title.notna()].iterrows():
title_to_url_temp = row[1].title_to_url.lower()
title_temp = row[1].title.replace('-', '').replace(' ', '').lower()
if not title_to_url_temp in row[1]['url']:
df.loc[row[0], 'bad_title'] = True
In total - 1051 results like:
title | url | title_to_url |
---|---|---|
<em class="markup--em markup--h3-em">What I Learned from (Two-time) Kaggle Grandmaster Abhishek Thakur</em> | https://towardsdatascience.com/what-i-learned-from-abhishek-thakur-4b905ac0fd55 | em-class-markup-em-markup-h3-em-what-i-learned-from-two-time-kaggle-grandmaster-abhishek-thakur-em |
Faster Training for Efficient CNNs | https://towardsdatascience.com/faster-training-of-efficient-cnns-657953aa080 | faster-training-for-efficient-cnns |
Buyers beware, fake product reviews are plaguing the internet. How Machine Learning can help to spot them. | https://towardsdatascience.com/buyers-beware-fake-product-reviews-are-plaguing-the-internet-cfc599c42b6b | buyers-beware-fake-product-reviews-are-plaguing-the-internet-how-machine-learning-can-help-to-spot-them |
You can use different techniques in order to get better results like:
- replace hyphens from both sides
- convert to lower case
Step 4: Extract the slug from the URL and compare to title
In this step we are going to extract the slug from the URL. This is not a simple operation and might depend on many factors.
We can't replace the domain or extract the slug with simple split:
df['url'].str.split('/', expand = True)
Result:
0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
https: | medium.com | swlh | how-to-go-from-zero-to-hero-through-personal-branding-e5edab472dfa | |
https: | medium.com | swlh | five-augmented-reality-uses-that-solve-real-life-problems-9e7ff77be858 | |
https: | towardsdatascience.com | everything-you-need-to-know-about-autoencoders-in-tensorflow-b6a63e8255f0 | None | |
https: | medium.com | datadriveninvestor | electric-company-cars-get-a-tax-incentive-in-germany-a0c207da0bb9 | |
https: | uxdesign.cc | ux-is-not-a-role-it-is-a-title-b072aff1c98c | None |
As you can see there are multiple domains and formats.
What we are going to do is extracting the slug by chaining several Pandas operations:
df['url'].str.rsplit('/').str[-1].str.rsplit('-', 1, expand=True)[0]
**How does it work? **
Starting with -
https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129
We are using the method rsplit
in order to split on /
.
[https:, , towardsdatascience.com, how-to-use-ggplot2-in-python-74ab8adec129]
Then adding str[-1]
in order to get the part after the last backslash:
how-to-use-ggplot2-in-python-74ab8adec129
In order to get rid of that final identifier - -74ab8adec129
we will use rsplit
again - .rsplit('-', 1, expand=True)[0]
Now we can compare the slug from the URL with the modified title:
df[df['url'].str.rsplit('/').str[-1].str.rsplit('-', 1, expand=True)[0] != df['title_to_url']]
The result is pretty similar to the one from Step 3 - but we have more results - 1119.
But there is a difference between both methods. This step is more accurate and will report cases like:
- Title:
Living as an Empath — w
- URL:
https://medium.com/swlh/living-as-an-empath-when-you-feel-everything-and-nobody-seems-to-understand-65fecce2aa03
- title_to_url:
living-as-an-empath-w
while Step 3 will miss this result since the title_to_url
is part of the URL.