In this post, we will Pandas and Python to collect football data and analyse it. We will try to predict probability for the outcome and the result of the fooball game between: Barcelona vs Real Madrid.

Today is a great day for football fans - Barcelona vs Real Madrid game will be held tomorrow. Fans try to predict:

  • Who is better: Barca or Real?
  • Who won the most El Clasico?
  • Who will win the game?
  • Barcelona vs Real Madrid - Prediction, Odds and Betting Tips

Can we use data science to find answers to those questions? Let's give it a try.

From this post you can learn:

  • Scraping tables with Pandas
  • Pivot/unpivot tables
  • Pandas chaining
  • Fuzzy string matching in Python
  • Basic implementation in Python of:
    • Markov chain
    • ML model

At the end there is a link to Python playbook in Kaggle.

1. Collect stats

Often things start with data collection. Nowadays it is much easier to collect data. Below you can find few ways to scrape football data with Python:

Wikipedia - Historical data

Wikipedia is a great source of information for El Clasico. We will use Pandas method pd.read_html to collect first table on the page:

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_El_Cl%C3%A1sico_matches'
df = pd.read_html(url)[0]
df

data looks like:

No. Date Matchweek Home team Away team Score (FT/HT) Goals (Home) Goals (Away)
0 1 17 February 1929 2 Barcelona Real Madrid 1–2 (0–1) Parera (70) Morera (10, 55)
1 2 9 May 1929 11 Real Madrid Barcelona 0–1 (0–0) NaN Sastre (83)
2 3 26 January 1930 9 Barcelona Real Madrid 1–4 (0–3) Bestit (63) Rubio (10, 37), F. López (17), Lazcano (71)
3 4 30 March 1930 18 Real Madrid Barcelona 5–1 (3–0) Rubio (5, 23), Lazcano (42, 68, 72) Goiburu (84)
4 5 1 February 1931 9 Real Madrid Barcelona 0–0 NaN NaN

Wikipedia - Current season

To get the current results standings we can use wikipedia again with the following code:

import pandas as pd

url = 'https://en.wikipedia.org/wiki/2022%E2%80%9323_La_Liga'
df = pd.read_html(url)[6]
df

You can find the table which we are scraping

football-data-science-projects-real-vs-barcelona

Google

We can use google to search for the latests games and win probability by: barcelona vs real madrid.

Which will give us:

I'm using browser extension to extract the table: Table Capture

2. Transform Data

Table below is stored in pivot or aggregated mode. In this step we will transform the data to a normal form.

football-data-science-projects-real-vs-barcelona

melt column names to values

What we like to achieve is the following format:

Home \ Away variable value corr
0 Almería ALM VAL
1 Athletic Bilbao ALM 4–0 NaN
2 Atlético Madrid ALM NaN ATM
3 Barcelona ALM 2–0 NaN
4 Cádiz ALM 1–1 ELC

Pandas offer method melt which can turn columns into rows/values:

df_melt = pd.melt(df, id_vars=['Home \ Away'])
df_melt

3. Match abbreviations and full names

We have two different names of the teams:

  • full name
  • abbreviation

We will use basic fuzzy string matching to map them - we will cover two ways for mapping:

difflib

Module difflib is very powerful and offers a quick way to match similar strings. In our case we can try to map abbreviations and full names by method get_close_matches:

teams = ['Real Madrid', 'Real Sociedad', 'Rayo Vallecano']
difflib.get_close_matches('RSO', teams, n=3, cutoff=0.2)

First we can collect all names and run test for similarity by:

full_names = df_melt['Home \ Away'].unique()
abbr_names = df_melt['variable'].unique()

for keyword in abbr_names:
	matches = difflib.get_close_matches(keyword, full_names, n=20, cutoff=0.2)
	print(keyword, matches)

Results are far for perfect:

ALM ['Atlético Madrid', 'Almería']
ATH ['Almería']
ATM ['Atlético Madrid', 'Almería']
BAR ['Almería']
CAD ['Cádiz', 'Almería']
CEL ['Elche', 'Cádiz']
ELC ['Elche', 'Cádiz']
ESP ['Elche', 'Sevilla']
GET ['Elche', 'Girona', 'Getafe']

So the full code for Pandas to match the teams would be:

import difflib

correct_values = {}
words = full_names

for keyword in abbr_names:
	similar = difflib.get_close_matches(keyword, words, n=3, cutoff=0.2)
	for x in similar:
    	correct_values[x] = keyword
        	 
df_melt["corr"] = df_melt["Home \ Away"].map(correct_values)

This method is very generic and produces results with errors.

Regex to match name and abbreviation

Let's try one more way to match abbreviations to possible names by using regex. This way is adjusted to the logic of abbreviating football teams.

The code is:

import re    
def is_abbrev(abbrev, words):
	matches = []
	for word in words:
    	pattern = "(|.*\s)".join(abbrev.lower())
    	if re.match("^" + pattern, word.lower()) is not None:
        	matches.append(word)
	return matches

for keyword in abbr_names:
	matches = is_abbrev(keyword, full_names)
	print(keyword, matches)

and the results are much better:

ALM ['Almería']
ATH ['Athletic Bilbao']
ATM ['Atlético Madrid']
BAR ['Barcelona']
CAD []
CEL ['Celta Vigo']
ELC ['Elche']
ESP ['Espanyol']
GET ['Getafe']
GIR ['Girona']
MLL []

Now we can export all games between Barcelona and Real Madrid. We can also find their latest games in Spain.

4. Python: odds in sport

This is a definition from wikipedia about odds:

In probability theory, odds provide a measure of the likelihood of a particular outcome. They are calculated as the ratio of the number of events that produce that outcome to the number that do not.

You can find more about this in the Resource section.

soccerapi

Below you can find simple way to collect odds in Python by using library -soccerapi - pip install soccerapi:

from soccerapi.api import Api888Sport

api = Api888Sport()
url = 'https://www.888sport.com/#/filter/football/spain/'
odds = api.odds(url)

pd.DataFrame(odds)

This will collect all odds for Spain:

time home_team away_team full_time_result under_over both_teams_to_score double_chance
0 2023-04-04T19:00:00Z Athletic Bilbao Osasuna {'1': 1460, 'X': 4000, '2': 7500} {'O2.5': 2140, 'U2.5': 1730} {'yes': 2250, 'no': 1610} {'1X': 1110, '12': 1260, '2X': 2500}
1 2023-04-05T19:00:00Z FC Barcelona Real Madrid {'1': 2250, 'X': 3400, '2': 3100} {'O2.5': 1830, 'U2.5': 2000} {'yes': 1650, 'no': 2180} {'1X': 1380, '12': 1330, '2X': 1610}
2 2023-04-07T16:30:00Z Lugo Tenerife {'1': 4100, 'X': 2750, '2': 2020} {'O2.5': 1630, 'U2.5': 2200} {'yes': 2430, 'no': 1520} {'1X': 1740, '12': 1410, '2X': 1220}
3 2023-04-07T19:00:00Z Villarreal B Málaga {'1': 1930, 'X': 3250, '2': 3700} {'O2.5': 2200, 'U2.5': 1630} {'yes': 1980, 'no': 1780} {'1X': 1260, '12': 1320, '2X': 1810}
4 2023-04-07T19:00:00Z Sevilla Celta Vigo {'1': 2150, 'X': 3300, '2': 3400} {'O2.5': 2250, 'U2.5': 1640} {'yes': 1980, 'no': 1780} {'1X': 1330, '12': 1340, '2X': 1670}

We can find the odds for: FC Barcelona and Real Madrid: {'1': 2250, 'X': 3400, '2': 3100}.

sports-betting

Another useful Python package for betting is: sports-betting. It can be installed by: pip install sports-betting

This package require Python 3.9+ and offer simple API:

from sportsbet.datasets import SoccerDataLoader
dataloader = SoccerDataLoader(param_grid={'league': ['Italy'], 'year': [2020]})
X_train, Y_train, O_train = dataloader.extract_train_data(odds_type='market_maximum', drop_na_thres=1.0)
X_fix, Y_fix, O_fix = dataloader.extract_fixtures_data()

Calculate odds

We can follow the next steps in order to calculate the odds in a custom code:

  • Barcelona has a home advantage.
  • Barcelona has (out of their last 10 games):
    • won 7
    • drawn 1
    • lost 2
  • Real Madrid:
    • won 6
    • drawn 2
    • lost 2

To calculate the odds for this match, we can use simple Python code:

barca_win_rate = 7 / 10
real_win_rate = 6 / 10
draw_rate = 1 - barca_win_rate - real_win_rate

barca_odds = 1 / barca_win_rate
real_odds = 1 / real_win_rate
draw_odds = 1 / draw_rate

print("Barcelona odds:", barca_odds)
print("Real Madrid odds:", real_odds)
print("Draw odds:", draw_odds)

which give us:

Barcelona odds: 1.4285714285714286
Real Madrid odds: 1.6666666666666667
Draw odds: -3.333333333333334

5. Python Markov Chain

Finally we can use Markov Chains to calculate probability for win, draw and lose.

Collect data

We will collect all previous el clasico games by:

import pandas as pd

url = 'http://eurorivals.net/head-to-head/barcelona-vs-real-madrid'

df_el_cl = pd.read_html(url)[1]
df_el_cl.head()

result:

Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 2023 19 March NaN Primera Barcelona 2 - 1 Real Madrid
1 2023 2 March NaN Cup Real Madrid 0 - 1 Barcelona
2 2022 16 October NaN Primera Real Madrid 3 - 1 Barcelona
3 2022 20 March NaN Primera Real Madrid 0 - 4 Barcelona
4 2021 24 October NaN Primera Barcelona 1 - 2 Real Madrid

Data cleaning with Panda chaining

In this step we will demonstrate how to use Pandas chaining for data cleaning and preprocessing:

import numpy as np

df_trans = (
df
  .set_axis(['date', 'flag', 'tournament', 'home', 'score', 'away'], axis=1) # rename columns
  .drop('flag', axis=1) # drop columns
  .assign(score_home=lambda x: x.score.str.split('-', expand=True)[0].astype(int)) # split and add new column
  .assign(score_away=lambda x: x.score.str.split('-', expand=True)[1].astype(int)) # split and add new column
  .assign(date_n=lambda x: pd.to_datetime(x.date)) # convert to datetime
  .assign(goal_diff=lambda x: x.score_home - x.score_away) # compare two columns
  .assign(result= lambda x: x['goal_diff'].apply(lambda y: "L" if y < 0 else ("W" if y > 0 else "D"))) # conditional in chaining
  .set_index('date_n') # set new index
  .sort_index()  # sort by index
  .loc[:,['home', 'away', 'score_home', 'score_away', 'result']] # return column from chaining
)

After the preprocessing we have this data:

home away score_home score_away result
date_n
2004-04-24 Real Madrid Barcelona 1 2 L
2004-11-19 Barcelona Real Madrid 3 0 W
2005-04-09 Real Madrid Barcelona 4 2 W
2005-11-18 Real Madrid Barcelona 0 3 L
2006-03-31 Barcelona Real Madrid 1 1 D
2006-10-22 Real Madrid Barcelona 2 0 W
2007-03-10 Barcelona Real Madrid 3 3 D
2007-12-23 Barcelona Real Madrid 0 1 L
2008-05-07 Real Madrid Barcelona 4 1 W
2008-12-13 Barcelona Real Madrid 2 0 W
2009-05-02 Real Madrid Barcelona 2 6 L
2009-11-29 Barcelona Real Madrid 1 0 W
2010-04-10 Real Madrid Barcelona 0 2 L
2010-11-29 Barcelona Real Madrid 5 0 W
2011-04-16 Real Madrid Barcelona 1 1 D
2011-04-20 Barcelona Real Madrid 0 0 D
2011-04-27 Real Madrid Barcelona 0 2 L
2011-05-03 Barcelona Real Madrid 1 1 D
2011-12-10 Real Madrid Barcelona 1 3 L
2012-01-18 Real Madrid Barcelona 1 2 L
2012-01-25 Barcelona Real Madrid 1 2 L
2012-04-21 Barcelona Real Madrid 1 2 L
2012-10-07 Barcelona Real Madrid 2 2 D
2013-01-30 Real Madrid Barcelona 1 1 D
2013-02-26 Barcelona Real Madrid 1 3 L
2013-03-02 Real Madrid Barcelona 2 1 W
2013-10-26 Barcelona Real Madrid 2 1 W
2014-03-23 Real Madrid Barcelona 3 4 L
2014-04-16 Barcelona Real Madrid 1 2 L
2014-10-25 Real Madrid Barcelona 3 1 W
2015-03-22 Barcelona Real Madrid 2 1 W
2015-11-21 Real Madrid Barcelona 0 4 L
2016-04-02 Barcelona Real Madrid 1 2 L
2016-12-03 Barcelona Real Madrid 1 1 D
2017-04-23 Real Madrid Barcelona 2 3 L
2017-12-23 Real Madrid Barcelona 0 3 L
2018-05-06 Barcelona Real Madrid 2 2 D
2018-10-28 Barcelona Real Madrid 5 1 W
2019-02-06 Barcelona Real Madrid 1 1 D
2019-02-27 Real Madrid Barcelona 0 3 L
2019-03-02 Real Madrid Barcelona 0 1 L
2019-12-18 Barcelona Real Madrid 0 0 D
2020-03-01 Real Madrid Barcelona 2 0 W
2020-10-24 Barcelona Real Madrid 1 3 L
2021-04-10 Real Madrid Barcelona 2 1 W
2021-10-24 Barcelona Real Madrid 1 2 L
2022-03-20 Real Madrid Barcelona 0 4 L
2022-10-16 Real Madrid Barcelona 3 1 W
2023-03-02 Real Madrid Barcelona 0 1 L
2023-03-19 Barcelona Real Madrid 2 1 W

Markov chains probability

We will use package mchmm which can be installed by: pip install mchmm

In order to find the transition matrix and plot graph of probability changes:

import mchmm as mc
a = mc.MarkovChain().from_data(df_trans['result'])

So we get probability matrix by:

a.observed_p_matrix

results into:

array([[0.18181818, 0.54545455, 0.27272727],
[0.26086957, 0.34782609, 0.39130435],
[0.2 , 0.53333333, 0.26666667]])

or as a table:

0 1 2
0 0.181818 0.545455 0.272727
1 0.260870 0.347826 0.391304
2 0.200000 0.533333 0.266667

And the following transition graph by:

graph = a.graph_make(
  	format="png",
  	graph_attr=[("rankdir", "LR")],
  	node_attr=[("fontname", "Roboto bold"), ("fontsize", "20")],
  	edge_attr=[("fontname", "Iosevka"), ("fontsize", "12")]
	)
graph.render()

graph:

markov_chain_graph_football_prediction

So the draw state seems to be less favorable. After winning, often there is a loss.

Note: We don't take into account home and away teams.

6. Predict score with simple ML model

Finally we can try to predict the score based on the scores so far. We can use simple ML model with:

  • inputs - home and away teams
  • outputs - home and away scores

The code is:

from sklearn.linear_model import LinearRegression

mapping = {'Barcelona':1, 'Real Madrid':2}
df_trans = df_trans.replace(mapping).head()

TRAIN_INPUT = df_trans[['home', 'away']].values
TRAIN_OUTPUT = df_trans[['score_home', 'score_away']].values

X_TEST = [[2,  1]]
outcome = predictor.predict(X=X_TEST)
coefficients = predictor.coef_

print(outcome)
print('Outcome : Coefficients : {}'.format(coefficients))

which give us:

[[1.66666667 2.33333333]]
Outcome : Coefficients : [[-0.16666667 0.16666667]
[ 0.91666667 -0.91666667]]

Note: this is a pretty naive way to try to predict score in Football.

7. Resources

Helpful resources for aspiring data scientists who would like to research football by data science deeper: