1. Introduction
Did you know that the Ancient Olympics date back to 776 BC? Even the philosopher Plato is believed to have competed in wrestling. Plato, Socrates, and Aristotle frequented gymnasia, where the rigorous training for games like the Olympics was seen as a metaphor for intellectual competition and the pursuit of truth. Today, the Games look very different — but age is still a fascinating factor in athletic success.
The Olympic Games bring together athletes from hundreds of countries. They compete for gold, silver, and bronze across dozens of sports. But at what age do athletes actually peak? A gymnast and a marathon runner follow very different paths. I challenged myself to answer this question using data from the official Milano Cortina 2026 Olympics website milano-cortina-2026/medals, recreating a visualization inspired by a popular r/dataisbeautiful post [OC] Olympic medal winners average age per sport.
This is a beginner-friendly data science project. You will learn how to:
- Scrape data from a real website
- Work with JSON data from the web
- Clean and organize data
- Create a clear and informative visualization
Whether you are a student or a self-learner, this project covers real-world skills from start to finish.
You can find the final result below:

2. Objectives
The goal of this project is simple:
- We collect medal data from the official Milano Cortina 2026 Olympics website.
- We collect ages
- Then we clean and organize it.
- Finally, we build a visualization that shows the average age of medal winners per sport.
The data comes directly from the website's JSON API — no manual downloading needed. The result is a chart inspired by a popular r/dataisbeautiful post, making it easy to compare athlete ages across different sports.
We collect the following variables for each medal winner:
- Name — full name of the athlete
- Age — age at the time of the Games
- Sport — the sport in which the medal was won
- Discipline — the specific event within the sport
- Medal Type — gold, silver, or bronze
- Country — the athlete's representing nation
- Gender — male, female, or mixed event
Note: Part of the variables come from the official Milano Cortina 2026 medals page. Athlete age is taken directly from the JSON data from the athlete page. If age is missing, it is calculated from the athlete's date of birth.
3. Language and Tools
| Task | Technique | Tools/Packages |
|---|---|---|
| Data Collection | Web scraping, JSON API requests | requests, json |
| Data Pre-processing | Merging data, handling missing values, calculating age from date of birth | pandas |
| Data Visualization | Scater plots | plotly |
| Language & Environment | Python; Jupyter Notebook |
4. Data Source
The Milano Cortina 2026 medal data is published by the International Olympic Committee (IOC) through the official Olympics website. It is publicly accessible via a JSON API endpoint — no account or API key is required.
This project collects data directly from the official medals page, including the following:
- Athlete Data: Name, Age, Gender, Country
- Event Data: Sport, Discipline, Event, Medal Type (Gold, Silver, Bronze)
Note: Two alternative datasets are available on Kaggle for those who prefer to skip scraping: a Milano Cortina 2026 dataset and a historical Olympics dataset (1896–2024). This project uses the unofficial API directly, as it provides the most up-to-date results.
⚠️ Before scraping any website, always read its Terms of Use. Use the data for personal and educational purposes only.
⚠️ Website can change its data structure, pages and protection. So this guide might be outdated in future. As this happen already for Paris 2024.
5. Data Collection
Before writing any code, we first explore how the website loads its data. This helps us find the raw JSON API endpoint directly.
Step 1 — Inspect the Network Tab
- Open the medals page in your browser
- Right-click anywhere on the page → click Inspect
- Go to the Network tab
- Reload the page (
F5) - In the filter bar, type Fetch/XHR to show only API requests
- Look for a request that contains medals or results in the URL
- Click on it → go to the Response tab to preview the raw JSON
Step 2 — Copy the Request as cURL
- Right-click on the request in the Network tab
- Click Copy → Copy as cURL
- This copies the full request — including headers and cookies — to your clipboard
Step 3 — Inspect the JSON in Insomnia
- Download and open Insomnia
- Click New Request → select Import from cURL
- Paste the copied cURL command
- Click Send
- Explore the JSON structure — identify fields like
name,age,sport,medal,country - Save the result as
'data/paris_olympiad_2024.json''data/milano_olympiad_2026.json'
💡 Tip: Importing as cURL is better than copying just the URL. It automatically includes all required headers, which helps avoid blocked or empty responses from the server.
💡 Tip: Insomnia is a free API client for small projects. It makes it easy to explore and test API responses before writing any code. Alternatives include Postman or simply the browser's Network tab itself.
Step 4 — Collect Athlete Ages
Athlete ages are not included in the medals JSON. We need to fetch them separately from each athlete's profile page.
How to find the age endpoint:
- Open an athlete profile page, e.g.:
https://www.olympics.com/en/milano-cortina-2026/results/athlete-details/41614https://olympics.com/en/paris-2024/athlete/sabrina-maneca-voinea_1550186
- Open the Network tab → reload the page
- Look for a request containing
CIS_Bio_Athleteorapi/v2/athletes?competitionCode - Copy as cURL → import into Insomnia
- The endpoint follows this pattern:
https://www.olympics.com/wmr-api/api/v2/athletes?competitionCode=OWG2026&code=41614&languageCode=ENGhttps://olympics.com/OG2024/data/CIS_Bio_Athlete~comp=OG2024~code={athlete_id}~lang=ENG.json
Fetch ages for all medalists:
- collect all medalists by:
medalists = df_m['extraData'].apply(pd.Series).drop_duplicates()['detailUrl'].str.split('_', expand=True)[1].dropna().values
medalists
- collect ages for medalists only by:
import time
import random
data = []
for id in medalists:
try:
time.sleep(random.randint(9, 53)) # polite delay between requests
url = f"https://olympics.com/OG2024/data/CIS_Bio_Athlete~comp=OG2024~code={id}~lang=ENG.json"
response = requests.request("GET", url, data=payload, headers=headers)
birth_date = json.loads(response.text).get('person').get('birthDate')
print(birth_date)
data.append(json.loads(response.text))
except:
print(f"Failed for id: {id}")
df_age = pd.DataFrame(data)['person'].apply(pd.Series)
df_age.to_json('data/age_600_800.json')
df_age
⚠️ Important: Always add a delay between requests (
time.sleep). Sending too many requests too fast may get your IP blocked. Therandom.randint(9, 53)adds a random wait of 9–53 seconds between each request — this mimics human browsing behavior and is more polite to the server.
6. Data Processing
6.1 Read the JSON File
We start by loading the raw JSON file saved from the API response.
import pandas as pd
import json
from IPython.display import JSON, display
with open('data/milano_olympiad_2026.json') as f:
d = json.load(f)
JSON(d)
6.2 Explore the JSON Structure
The JSON has a nested structure. We first explore the top-level keys to understand what data is available.
List of countries (NOCs):
pd.DataFrame(d.get('props').get('pageProps').get('nocList'))
result:
| longName | nameOrder | longNameOrder | continent | id | name | forgeSlug | |
|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 10 | 10 | ASI | AFG | Afghanistan | |
| 1 | Albania | 20 | 20 | EUR | ALB | Albania | noc-al |
| 2 | Algeria | 30 | 30 | AFR | ALG | Algeria | noc-dz |
| 3 | American Samoa | 40 | 40 | OCE | ASA | American Samoa | noc-as |
| 4 | Andorra | 50 | 50 | EUR | AND | Andorra | noc-ad |
Medal standings table:
df = pd.DataFrame(
d.get('props')
.get('pageProps')
.get('initialMedals')
.get('medalStandings')
.get('medalsTable')
)
df
6.3 Inspect a Single Country
We can drill down into a single country to understand the nested structure of disciplines and medal winners.
df[df.organisation == 'BUL']['disciplines'] \
.apply(pd.Series).T[25] \
.apply(pd.Series)['medalWinners'] \
.apply(pd.Series).stack() \
.apply(pd.Series)
result:
| code | name | gold | silver | bronze | total | medalWinners | |
|---|---|---|---|---|---|---|---|
| 0 | BOX | Boxing | 0 | 0 | 1 | 1 | [{'disciplineCode': 'BOX', 'eventCode': 'BOXM57KG--------------', 'eventCategory': 'Men', 'eventDescription': 'Men's 57kg', |
| 1 | GRY | Rhythmic Gymnastics | 0 | 1 | 0 | 1 | [{'disciplineCode': 'GRY', 'eventCode': 'GRYW1AA---------------', 'eventCategory': 'Women', 'eventDescription': 'Individual All-Around', |
| 2 | TKW | Taekwondo | 0 | 0 | 1 | 1 | [{'disciplineCode': 'TKW', 'eventCode': 'TKWW57KG--------------', 'eventCategory': 'Women', 'eventDescription': 'Women -57kg', |
| 3 | WLF | Weightlifting | 1 | 0 | 1 | 2 | [{'disciplineCode': 'WLF', 'eventCode': 'WLFM89KG--------------', 'eventCategory': 'Men', 'eventDescription': 'Men's 89kg', |
| 4 | WRE | Wrestling | 2 | 0 | 0 | 2 | [{'disciplineCode': 'WRE', 'eventCode': 'WREMGR87KG------------', 'eventCategory': 'Men', 'eventDescription': 'Men's Greco-Roman 87kg', |
6.4 Map Discipline Codes to Names
The data uses short codes for each sport. We create a mapping dictionary to convert them to readable names.
disciplineCode_mapping = {
"ARC": "Archery",
"ATH": "Athletics",
"BDM": "Badminton",
# ... add all remaining codes
}
6.5 Flatten the Full Dataset
Finally, we flatten the nested JSON into a clean, analysis-ready DataFrame and apply the discipline name mapping.
df_m = (
df['disciplines']
.apply(pd.Series).stack()
.apply(pd.Series)['medalWinners']
.apply(pd.Series).stack()
.apply(pd.Series)
)
df_m['disciplineName'] = df_m['disciplineCode'].map(disciplineCode_mapping)
df_m
💡 Tip: The
.stack()method is key here. It unpacks nested lists inside DataFrame cells into rows — making it much easier to work with deeply nested JSON data.
6.6 Load JSON Age Files
We saved the age data in multiple JSON files (e.g. age_0_200.json, age_200_400.json, etc.). We load and combine them all into a single DataFrame.
import pandas as pd
import glob, os, json
json_dir = 'data/'
json_pattern = os.path.join(json_dir, 'age*.json')
file_list = glob.glob(json_pattern)
dfs = []
for file in file_list:
with open(file) as f:
df_temp = pd.read_json(file)
df_temp['file'] = file.rsplit("/", 1)[-1] # track which file the row came from
dfs.append(df_temp)
df_age = pd.concat(dfs)
We only need two columns — the athlete's code and their birth date:
df_age[['birthDate', 'code']]
6.7 Merge Age Data with Medal Data
Now we merge the age data with our medals DataFrame. We match athletes using their unique competitor code.
df_m = pd.merge(
df_m,
df_age,
right_on='code',
left_on='competitorCode',
how='inner'
)
df_m
💡 Note: We use an
innerjoin — this keeps only athletes that appear in both DataFrames. Athletes with missing age data will be excluded from the final visualization.
7. Exploratory Data Analysis (EDA)
7.1 Inspect a Single Sport
We start by filtering the data for a single discipline. Here we look at Rowing as an example.
df_p = df_m[df_m['disciplineCode'] == 'ROW']
df_p
the result below is truncated for visibility:
| disciplineCode | eventCode | eventCategory | eventDescription | eventOrder | medalType | |||
|---|---|---|---|---|---|---|---|---|
| 0 | 19 | 0 | ROW | ROWMNOCOX4------------ | Men | Men's Four | 6 | ME_GOLD |
| 1 | ROW | ROWMCOXED8------------ | Men | Men's Eight | 14 | ME_BRONZE | ||
| 3 | 12 | 0 | ROW | ROWWNOCOX2------------ | Women | Women's Pair | 1 | ME_BRONZE |
| 5 | 9 | 0 | ROW | ROWWNOCOX2------------ | Women | Women's Pair | 1 | ME_GOLD |
| 1 | ROW | ROWWNOCOX4------------ | Women | Women's Four | 5 | ME_GOLD |
7.2 Pivot by Event and Medal Type
We create a pivot table to get a cleaner view — one row per event, one column per medal type.
pd.pivot(
df_p.drop((21, 1, 2)),
index='eventDescription',
columns='medalType',
values='competitorDisplayName'
)
Output:
| medalType | ME_BRONZE | ME_GOLD | ME_SILVER |
|---|---|---|---|
| eventDescription | |||
| Lightweight Men's Double Sculls | PAPAKONSTA/GKAIDATZIS | Mc CARTHY/O DONOVAN | OPPO/SOARES |
| Lightweight Women's Double Sculls | KONTOU/FITSIOU | CRAIG/GRANT | van GRONINGEN/COZMIUC |
| Men's Double Sculls | LYNCH/DOYLE | CORNEA/ENACHE | TWELLAAR/BROENINK |
| Men's Eight | United States | Great Britain | Netherlands |
| Men's Four | Great Britain | United States | New Zealand |
7.3 Inspect a Specific Event
We can also filter down to a single event to inspect the medal winners directly.
df_p[df_p['eventDescription'] == "Women's High Jump"]
7.4 Handle Duplicate Entries
The pivot above may throw an error:
ValueError: Index contains duplicate entries, cannot reshape
This happens because some events have duplicate rows. We can identify them like this:
df_p[df_p.duplicated(subset=['eventDescription', 'medalType'])].T
⚠️ Note: Duplicate entries can occur in team events or relay races — where multiple athletes share the same event and medal type. These need to be handled before pivoting. Options include dropping duplicates, aggregating athlete names into a list, or filtering to individual events only.
8. Visualization
We use Plotly Express to create an interactive scatter plot. Each point represents a medal winner. The y-axis shows the sport, the x-axis shows the athlete's age, and the color indicates the medal type.
import plotly.express as px
fig = px.scatter(
df_plotly,
y="disciplineName",
x="age",
color="medalType",
symbol="medalType",
symbol_sequence=['diamond', 'circle', 'circle', 'circle'],
color_discrete_sequence=['green', 'silver', 'gold', 'brown']
)
fig.update_layout(
title="Olympic medal winners average age per sport",
xaxis=dict(
showgrid=False,
showline=True,
linecolor='black',
tickfont_color='black',
showticklabels=True,
dtick=10,
ticks='outside',
tickcolor='rgb(102, 102, 102)',
),
margin=dict(l=140, r=40, b=50, t=80),
legend=dict(
font_size=10,
yanchor='middle',
xanchor='right',
),
autosize=False,
width=800,
height=1200,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_traces(
mode='markers',
marker=dict(
line_width=1,
opacity=0.4,
size=14,
line=dict(
color='MediumPurple',
width=2
)
)
)
fig.show()
Key design choices:
symbol_sequence— diamonds mark gold medals, circles mark silver and bronzecolor_discrete_sequence— green, silver, gold, and brown map to each medal typeopacity=0.4— transparent markers help visualize overlapping data pointshovermode='closest'— hovering over a point shows the athlete's detailspaper_bgcolorandplot_bgcolor— white background keeps the chart clean and minimal
💡 Tip: Plotly charts are interactive by default. You can zoom, pan, and hover over individual points to explore the data. To save the chart as a static image, use
fig.write_image("chart.png")— requires thekaleidopackage.
9. Results
The scatter plot shows the age distribution of Olympic medal winners across all sports at the Paris 2024 Games. Each point represents one athlete. The diamond marker shows the average age per sport.
Key observations:
- Skateboarding and Rhythmic Gymnastics have the youngest medal winners — most athletes cluster between ages 15–20
- Equestrian has the oldest and widest age distribution — with medal winners ranging from their 30s all the way to their 60s
- Artistic Gymnastics and Diving also skew young — with most winners in their late teens and early 20s
- Wrestling, Athletics, and Judo show a wide age spread — suggesting these sports are competitive across a broader age range
- Tennis shows one clear outlier on the right — Novak Djokovic won the gold medal in men's singles tennis at the 2024 Paris Olympics at the age of 37
- Shooting has one of the widest distributions — with winners spanning from around age 15 to nearly 60
- Rowing and Equestrian have the oldest average ages — their diamonds sit furthest to the right
10. Conclusion
This project shows that age plays a very different role depending on the sport. Technical and artistic sports like Skateboarding and Gymnastics peak early. Sports requiring experience, strength, and precision — like Equestrian and Shooting — allow athletes to compete and win well into their 40s and 50s. The data confirms what many sports scientists suggest: peak athletic age is not universal — it depends entirely on what the sport demands.
11. Next Steps
This project is a starting point. There is a lot more you can explore and build on top of it.
Coming soon:
- A video walkthrough of this project will be published in the coming days: on the article and YouTube
- The full Jupyter Notebook with all code steps will be made available on GitHub
- Note: parts of this article are based on Paris 2024 data — the Milano Cortina 2026 dataset is not yet complete as the Games are still ongoing
Homework ideas — try it yourself:
- Update the visualization by gender — do men and women peak at different ages in the same sport?
- Filter or color by country or continent — which regions dominate in younger vs older age groups?
- Find the outliers — who is the youngest and the oldest medal winner in the dataset?
- Improve the visualization to make it more visually appealing
Have ideas or suggestions? Join the discussion in the Discord channel or leave a comment on the article. All feedback is welcome!