1. Introduction

Did you know that the Ancient Olympics date back to 776 BC? Even the philosopher Plato is believed to have competed in wrestling. Plato, Socrates, and Aristotle frequented gymnasia, where the rigorous training for games like the Olympics was seen as a metaphor for intellectual competition and the pursuit of truth. Today, the Games look very different — but age is still a fascinating factor in athletic success.

The Olympic Games bring together athletes from hundreds of countries. They compete for gold, silver, and bronze across dozens of sports. But at what age do athletes actually peak? A gymnast and a marathon runner follow very different paths. I challenged myself to answer this question using data from the official Milano Cortina 2026 Olympics website milano-cortina-2026/medals, recreating a visualization inspired by a popular r/dataisbeautiful post [OC] Olympic medal winners average age per sport.

This is a beginner-friendly data science project. You will learn how to:

  • Scrape data from a real website
  • Work with JSON data from the web
  • Clean and organize data
  • Create a clear and informative visualization

Whether you are a student or a self-learner, this project covers real-world skills from start to finish.

You can find the final result below:

2. Objectives

The goal of this project is simple:

  • We collect medal data from the official Milano Cortina 2026 Olympics website.
  • We collect ages
  • Then we clean and organize it.
  • Finally, we build a visualization that shows the average age of medal winners per sport.

The data comes directly from the website's JSON API — no manual downloading needed. The result is a chart inspired by a popular r/dataisbeautiful post, making it easy to compare athlete ages across different sports.

We collect the following variables for each medal winner:

  • Name — full name of the athlete
  • Age — age at the time of the Games
  • Sport — the sport in which the medal was won
  • Discipline — the specific event within the sport
  • Medal Type — gold, silver, or bronze
  • Country — the athlete's representing nation
  • Gender — male, female, or mixed event

Note: Part of the variables come from the official Milano Cortina 2026 medals page. Athlete age is taken directly from the JSON data from the athlete page. If age is missing, it is calculated from the athlete's date of birth.

3. Language and Tools

Task Technique Tools/Packages
Data Collection Web scraping, JSON API requests requests,
json
Data Pre-processing Merging data, handling missing values, calculating age from date of birth pandas
Data Visualization Scater plots plotly
Language & Environment Python;
Jupyter Notebook

4. Data Source

The Milano Cortina 2026 medal data is published by the International Olympic Committee (IOC) through the official Olympics website. It is publicly accessible via a JSON API endpoint — no account or API key is required.

This project collects data directly from the official medals page, including the following:

  • Athlete Data: Name, Age, Gender, Country
  • Event Data: Sport, Discipline, Event, Medal Type (Gold, Silver, Bronze)

Note: Two alternative datasets are available on Kaggle for those who prefer to skip scraping: a Milano Cortina 2026 dataset and a historical Olympics dataset (1896–2024). This project uses the unofficial API directly, as it provides the most up-to-date results.

⚠️ Before scraping any website, always read its Terms of Use. Use the data for personal and educational purposes only.

⚠️ Website can change its data structure, pages and protection. So this guide might be outdated in future. As this happen already for Paris 2024.

5. Data Collection

Before writing any code, we first explore how the website loads its data. This helps us find the raw JSON API endpoint directly.

Step 1 — Inspect the Network Tab

  1. Open the medals page in your browser
  2. Right-click anywhere on the page → click Inspect
  3. Go to the Network tab
  4. Reload the page (F5)
  5. In the filter bar, type Fetch/XHR to show only API requests
  6. Look for a request that contains medals or results in the URL
  7. Click on it → go to the Response tab to preview the raw JSON

Step 2 — Copy the Request as cURL

  • Right-click on the request in the Network tab
  • Click CopyCopy as cURL
  • This copies the full request — including headers and cookies — to your clipboard

Step 3 — Inspect the JSON in Insomnia

  • Download and open Insomnia
  • Click New Request → select Import from cURL
  • Paste the copied cURL command
  • Click Send
  • Explore the JSON structure — identify fields like name, age, sport, medal, country
  • Save the result as
    • 'data/paris_olympiad_2024.json'
    • 'data/milano_olympiad_2026.json'

💡 Tip: Importing as cURL is better than copying just the URL. It automatically includes all required headers, which helps avoid blocked or empty responses from the server.

💡 Tip: Insomnia is a free API client for small projects. It makes it easy to explore and test API responses before writing any code. Alternatives include Postman or simply the browser's Network tab itself.

Step 4 — Collect Athlete Ages

Athlete ages are not included in the medals JSON. We need to fetch them separately from each athlete's profile page.

How to find the age endpoint:

  1. Open an athlete profile page, e.g.:
    • https://www.olympics.com/en/milano-cortina-2026/results/athlete-details/41614
    • https://olympics.com/en/paris-2024/athlete/sabrina-maneca-voinea_1550186
  2. Open the Network tab → reload the page
  3. Look for a request containing CIS_Bio_Athlete or api/v2/athletes?competitionCode
  4. Copy as cURL → import into Insomnia
  5. The endpoint follows this pattern:
    • https://www.olympics.com/wmr-api/api/v2/athletes?competitionCode=OWG2026&code=41614&languageCode=ENG
    • https://olympics.com/OG2024/data/CIS_Bio_Athlete~comp=OG2024~code={athlete_id}~lang=ENG.json

Fetch ages for all medalists:

  • collect all medalists by:
medalists = df_m['extraData'].apply(pd.Series).drop_duplicates()['detailUrl'].str.split('_', expand=True)[1].dropna().values
medalists
  • collect ages for medalists only by:
import time
import random

data = []
for id in medalists:
    try:
        time.sleep(random.randint(9, 53))  # polite delay between requests
        url = f"https://olympics.com/OG2024/data/CIS_Bio_Athlete~comp=OG2024~code={id}~lang=ENG.json"
        response = requests.request("GET", url, data=payload, headers=headers)
        birth_date = json.loads(response.text).get('person').get('birthDate')
        print(birth_date)
        data.append(json.loads(response.text))
    except:
        print(f"Failed for id: {id}")

df_age = pd.DataFrame(data)['person'].apply(pd.Series)
df_age.to_json('data/age_600_800.json')
df_age

⚠️ Important: Always add a delay between requests (time.sleep). Sending too many requests too fast may get your IP blocked. The random.randint(9, 53) adds a random wait of 9–53 seconds between each request — this mimics human browsing behavior and is more polite to the server.

6. Data Processing

6.1 Read the JSON File

We start by loading the raw JSON file saved from the API response.

import pandas as pd
import json
from IPython.display import JSON, display

with open('data/milano_olympiad_2026.json') as f:
    d = json.load(f)

JSON(d)

6.2 Explore the JSON Structure

The JSON has a nested structure. We first explore the top-level keys to understand what data is available.

List of countries (NOCs):

pd.DataFrame(d.get('props').get('pageProps').get('nocList'))

result:

longName nameOrder longNameOrder continent id name forgeSlug
0 Afghanistan 10 10 ASI AFG Afghanistan
1 Albania 20 20 EUR ALB Albania noc-al
2 Algeria 30 30 AFR ALG Algeria noc-dz
3 American Samoa 40 40 OCE ASA American Samoa noc-as
4 Andorra 50 50 EUR AND Andorra noc-ad

Medal standings table:

df = pd.DataFrame(
    d.get('props')
     .get('pageProps')
     .get('initialMedals')
     .get('medalStandings')
     .get('medalsTable')
)
df

6.3 Inspect a Single Country

We can drill down into a single country to understand the nested structure of disciplines and medal winners.

df[df.organisation == 'BUL']['disciplines'] \
    .apply(pd.Series).T[25] \
    .apply(pd.Series)['medalWinners'] \
    .apply(pd.Series).stack() \
    .apply(pd.Series)

result:

code name gold silver bronze total medalWinners
0 BOX Boxing 0 0 1 1 [{'disciplineCode': 'BOX', 'eventCode': 'BOXM57KG--------------', 'eventCategory': 'Men', 'eventDescription': 'Men's 57kg',
1 GRY Rhythmic Gymnastics 0 1 0 1 [{'disciplineCode': 'GRY', 'eventCode': 'GRYW1AA---------------', 'eventCategory': 'Women', 'eventDescription': 'Individual All-Around',
2 TKW Taekwondo 0 0 1 1 [{'disciplineCode': 'TKW', 'eventCode': 'TKWW57KG--------------', 'eventCategory': 'Women', 'eventDescription': 'Women -57kg',
3 WLF Weightlifting 1 0 1 2 [{'disciplineCode': 'WLF', 'eventCode': 'WLFM89KG--------------', 'eventCategory': 'Men', 'eventDescription': 'Men's 89kg',
4 WRE Wrestling 2 0 0 2 [{'disciplineCode': 'WRE', 'eventCode': 'WREMGR87KG------------', 'eventCategory': 'Men', 'eventDescription': 'Men's Greco-Roman 87kg',

6.4 Map Discipline Codes to Names

The data uses short codes for each sport. We create a mapping dictionary to convert them to readable names.

disciplineCode_mapping = {
    "ARC": "Archery",
    "ATH": "Athletics",
    "BDM": "Badminton",
    # ... add all remaining codes
}

6.5 Flatten the Full Dataset

Finally, we flatten the nested JSON into a clean, analysis-ready DataFrame and apply the discipline name mapping.

df_m = (
    df['disciplines']
    .apply(pd.Series).stack()
    .apply(pd.Series)['medalWinners']
    .apply(pd.Series).stack()
    .apply(pd.Series)
)

df_m['disciplineName'] = df_m['disciplineCode'].map(disciplineCode_mapping)
df_m

💡 Tip: The .stack() method is key here. It unpacks nested lists inside DataFrame cells into rows — making it much easier to work with deeply nested JSON data.

6.6 Load JSON Age Files

We saved the age data in multiple JSON files (e.g. age_0_200.json, age_200_400.json, etc.). We load and combine them all into a single DataFrame.

import pandas as pd
import glob, os, json

json_dir = 'data/'

json_pattern = os.path.join(json_dir, 'age*.json')
file_list = glob.glob(json_pattern)

dfs = []
for file in file_list:
    with open(file) as f:
        df_temp = pd.read_json(file)
        df_temp['file'] = file.rsplit("/", 1)[-1]  # track which file the row came from
    dfs.append(df_temp)

df_age = pd.concat(dfs)

We only need two columns — the athlete's code and their birth date:

df_age[['birthDate', 'code']]

6.7 Merge Age Data with Medal Data

Now we merge the age data with our medals DataFrame. We match athletes using their unique competitor code.

df_m = pd.merge(
    df_m,
    df_age,
    right_on='code',
    left_on='competitorCode',
    how='inner'
)
df_m

💡 Note: We use an inner join — this keeps only athletes that appear in both DataFrames. Athletes with missing age data will be excluded from the final visualization.

7. Exploratory Data Analysis (EDA)

7.1 Inspect a Single Sport

We start by filtering the data for a single discipline. Here we look at Rowing as an example.

df_p = df_m[df_m['disciplineCode'] == 'ROW']
df_p

the result below is truncated for visibility:

disciplineCode eventCode eventCategory eventDescription eventOrder medalType
0 19 0 ROW ROWMNOCOX4------------ Men Men's Four 6 ME_GOLD
1 ROW ROWMCOXED8------------ Men Men's Eight 14 ME_BRONZE
3 12 0 ROW ROWWNOCOX2------------ Women Women's Pair 1 ME_BRONZE
5 9 0 ROW ROWWNOCOX2------------ Women Women's Pair 1 ME_GOLD
1 ROW ROWWNOCOX4------------ Women Women's Four 5 ME_GOLD

7.2 Pivot by Event and Medal Type

We create a pivot table to get a cleaner view — one row per event, one column per medal type.

pd.pivot(
    df_p.drop((21, 1, 2)),
    index='eventDescription',
    columns='medalType',
    values='competitorDisplayName'
)

Output:

medalType ME_BRONZE ME_GOLD ME_SILVER
eventDescription
Lightweight Men's Double Sculls PAPAKONSTA/GKAIDATZIS Mc CARTHY/O DONOVAN OPPO/SOARES
Lightweight Women's Double Sculls KONTOU/FITSIOU CRAIG/GRANT van GRONINGEN/COZMIUC
Men's Double Sculls LYNCH/DOYLE CORNEA/ENACHE TWELLAAR/BROENINK
Men's Eight United States Great Britain Netherlands
Men's Four Great Britain United States New Zealand

7.3 Inspect a Specific Event

We can also filter down to a single event to inspect the medal winners directly.

df_p[df_p['eventDescription'] == "Women's High Jump"]

7.4 Handle Duplicate Entries

The pivot above may throw an error:

ValueError: Index contains duplicate entries, cannot reshape

This happens because some events have duplicate rows. We can identify them like this:

df_p[df_p.duplicated(subset=['eventDescription', 'medalType'])].T

⚠️ Note: Duplicate entries can occur in team events or relay races — where multiple athletes share the same event and medal type. These need to be handled before pivoting. Options include dropping duplicates, aggregating athlete names into a list, or filtering to individual events only.

8. Visualization

We use Plotly Express to create an interactive scatter plot. Each point represents a medal winner. The y-axis shows the sport, the x-axis shows the athlete's age, and the color indicates the medal type.

import plotly.express as px

fig = px.scatter(
    df_plotly,
    y="disciplineName",
    x="age",
    color="medalType",
    symbol="medalType",
    symbol_sequence=['diamond', 'circle', 'circle', 'circle'],
    color_discrete_sequence=['green', 'silver', 'gold', 'brown']
)

fig.update_layout(
    title="Olympic medal winners average age per sport",
    xaxis=dict(
        showgrid=False,
        showline=True,
        linecolor='black',
        tickfont_color='black',
        showticklabels=True,
        dtick=10,
        ticks='outside',
        tickcolor='rgb(102, 102, 102)',
    ),
    margin=dict(l=140, r=40, b=50, t=80),
    legend=dict(
        font_size=10,
        yanchor='middle',
        xanchor='right',
    ),
    autosize=False,
    width=800,
    height=1200,
    paper_bgcolor='white',
    plot_bgcolor='white',
    hovermode='closest',
)

fig.update_traces(
    mode='markers',
    marker=dict(
        line_width=1,
        opacity=0.4,
        size=14,
        line=dict(
            color='MediumPurple',
            width=2
        )
    )
)

fig.show()

Key design choices:

  • symbol_sequence — diamonds mark gold medals, circles mark silver and bronze
  • color_discrete_sequence — green, silver, gold, and brown map to each medal type
  • opacity=0.4 — transparent markers help visualize overlapping data points
  • hovermode='closest' — hovering over a point shows the athlete's details
  • paper_bgcolor and plot_bgcolor — white background keeps the chart clean and minimal

💡 Tip: Plotly charts are interactive by default. You can zoom, pan, and hover over individual points to explore the data. To save the chart as a static image, use fig.write_image("chart.png") — requires the kaleido package.

9. Results

The scatter plot shows the age distribution of Olympic medal winners across all sports at the Paris 2024 Games. Each point represents one athlete. The diamond marker shows the average age per sport.

Key observations:

  • Skateboarding and Rhythmic Gymnastics have the youngest medal winners — most athletes cluster between ages 15–20
  • Equestrian has the oldest and widest age distribution — with medal winners ranging from their 30s all the way to their 60s
  • Artistic Gymnastics and Diving also skew young — with most winners in their late teens and early 20s
  • Wrestling, Athletics, and Judo show a wide age spread — suggesting these sports are competitive across a broader age range
  • Tennis shows one clear outlier on the right — Novak Djokovic won the gold medal in men's singles tennis at the 2024 Paris Olympics at the age of 37
  • Shooting has one of the widest distributions — with winners spanning from around age 15 to nearly 60
  • Rowing and Equestrian have the oldest average ages — their diamonds sit furthest to the right

10. Conclusion

This project shows that age plays a very different role depending on the sport. Technical and artistic sports like Skateboarding and Gymnastics peak early. Sports requiring experience, strength, and precision — like Equestrian and Shooting — allow athletes to compete and win well into their 40s and 50s. The data confirms what many sports scientists suggest: peak athletic age is not universal — it depends entirely on what the sport demands.

11. Next Steps

This project is a starting point. There is a lot more you can explore and build on top of it.

Coming soon:

  • A video walkthrough of this project will be published in the coming days: on the article and YouTube
  • The full Jupyter Notebook with all code steps will be made available on GitHub
  • Note: parts of this article are based on Paris 2024 data — the Milano Cortina 2026 dataset is not yet complete as the Games are still ongoing

Homework ideas — try it yourself:

  • Update the visualization by gender — do men and women peak at different ages in the same sport?
  • Filter or color by country or continent — which regions dominate in younger vs older age groups?
  • Find the outliers — who is the youngest and the oldest medal winner in the dataset?
  • Improve the visualization to make it more visually appealing

Have ideas or suggestions? Join the discussion in the Discord channel or leave a comment on the article. All feedback is welcome!