Can Python Predict the World Cup Favorites? A Data‑Driven Analysis of All 32 Teams
This article uses a Kaggle dataset of roughly 40,000 matches from 1872 to the present, applies pandas and matplotlib in Python to compute win‑team columns, compare the historical performance of the top five and top nine World Cup nations, explore the impact of friendly matches, and generate visualisations that help assess each team's championship likelihood.
The analysis begins with a Kaggle dataset that aggregates World Cup finals, qualifiers, continental championships, and friendlies, providing about 40,000 match records with fields such as date, home team, away team, home goals, away goals, tournament type, city, and country.
Environment Setup
Windows 7
Python 3.6
Jupyter Notebook
pandas 0.22.0
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
# Resolve Chinese characters in plots
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
df = pd.read_csv('results.csv')
print(df.head())The dataframe columns include date, home_team, away_team, home_score, away_score, tournament, city, country, and neutral.
Creating a Winner Column
mask = df['home_score'] - df['away_score']
df.loc[mask > 0, 'win_team'] = df.loc[mask > 0, 'home_team']
df.loc[mask < 0, 'win_team'] = df.loc[mask < 0, 'away_team']
df.loc[mask == 0, 'win_team'] = 'Draw'All matches containing "FIFA" in the tournament field are extracted as the full World Cup dataset (including qualifiers).
df_FIFA_all = df[df['tournament'].str.contains('FIFA', regex=True)]Top‑5 Historical Powerhouses
Based on overall win counts, the five strongest teams are Germany, Argentina, Brazil, France, and Spain. Their head‑to‑head results across 43 matches are computed and visualised.
team_top5 = ['Germany', 'Argentina', 'Brazil', 'France', 'Spain']
df_FIFA_top5 = df_FIFA_all[(df_FIFA_all['home_team'].isin(team_top5)) & (df_FIFA_all['away_team'].isin(team_top5))]
s_FIFA_top5 = df_FIFA_top5.groupby('win_team')['win_team'].count()
s_FIFA_top5.drop('Draw', inplace=True)
s_FIFA_top5.sort_values(ascending=False, inplace=True)
s_FIFA_top5.plot(kind='bar', figsize=(10,6), title='Top Five in World Cup')The bar chart shows Germany leading the win‑count, followed by Brazil and Argentina.
Pairwise Comparisons of the Top‑5
Custom functions team_vs and team_vs_plot retrieve win counts for any two teams and plot them.
def team_vs(df, team_A, team_B):
df_pair = df[(df['home_team'].isin([team_A, team_B])) & (df['away_team'].isin([team_A, team_B]))]
return df_pair.groupby('win_team')['win_team'].count()
def team_vs_plot(df, team_A, team_B, ax):
s = team_vs(df, team_A, team_B)
s.plot(kind='bar', ax=ax)
ax.set_xlabel('')
ax.set_title(f'{team_A} vs {team_B}', fontsize=10)
ax.set_xticklabels(s.index, rotation=20)Four sub‑plots compare Brazil against Germany, Argentina, France, and Spain, revealing mixed results (e.g., Brazil 1‑1 Germany, Brazil 6‑3 Argentina, Brazil 1‑2 France, Brazil 3‑1 Spain).
Extending the Scope to 2014‑Onward
To focus on recent performance, matches from 2014 onward are filtered, yielding over 3,600 games. Win counts per nation are aggregated, and the top 50 are displayed.
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df_since_2014 = df[df['year'] >= 2014]
s_all = df_since_2014.groupby('win_team')['win_team'].count()
s_all.drop('Draw', inplace=True)
s_all.sort_values(ascending=True, inplace=True)
s_all.tail(50).plot(kind='barh', figsize=(8,16), title='Top 50 in all tournament since 2014')The chart highlights Mexico, France, Germany, Portugal, Brazil, Belgium, South Korea, and Spain as strong recent performers.
Top‑9 Nations Since 2014
From the previous ranking, the nine most successful teams are Brazil, France, Portugal, Argentina, Mexico, Belgium, Germany, Spain, and England. Their mutual matches (44 games) are analysed.
team_top9 = ['Brazil','France','Portugal','Argentina','Mexico','Belgium','Germany','Spain','England']
df_top9 = df_since_2014[(df_since_2014['home_team'].isin(team_top9)) & (df_since_2014['away_team'].isin(team_top9))]
s_top9 = df_top9.groupby('win_team')['win_team'].count()
s_top9.drop('Draw', inplace=True)
s_top9.sort_values(ascending=False, inplace=True)
s_top9.plot(kind='bar', figsize=(10,6), title='Top 9 in all tournament since 2014')Friendly matches constitute a large share of the data; a separate analysis removes them, showing a shift in win‑counts and a reduced number of matches (13 competitive games among the nine).
Custom Functions for Historical Win‑Rate
Two helper functions compute yearly win percentages and plot them.
def probability(df, year, team_A, team_B):
df_year = df[df['year'] >= year]
s = team_vs(df_year, team_A, team_B)
a_win = s.get(team_A, 0) / s.sum() if s.sum() else 0
b_win = s.get(team_B, 0) / s.sum() if s.sum() else 0
draw = 1 - a_win - b_win
return [year, a_win, b_win, draw]
def his_team_data(df, start, end, team_A, team_B):
rows = []
for yr in range(start, end+1):
rows.append(probability(df, yr, team_A, team_B))
return pd.DataFrame(rows, columns=['year', f'{team_A}_win_percentage', f'{team_B}_win_percentage', 'draw_percentage'])Using these, the article visualises Brazil vs Germany win‑rate trends from 1930‑2016 and 2000‑2016, and demonstrates how to loop over multiple opponents (e.g., Germany vs France, Portugal, …) to generate a series of plots.
Prediction Commentary
After the quantitative analysis, the author offers a qualitative prediction, noting recent upsets, the uncertain form of Argentina, and Germany’s partial recovery. The author stresses that the results are purely exploratory and should not be taken as definitive forecasts.
Conclusion
The notebook showcases a complete end‑to‑end workflow: data acquisition, cleaning, feature engineering (win_team), exploratory analysis of historical and recent performance, handling of friendly matches, and custom visualisation functions. It demonstrates how Python’s pandas and matplotlib can be leveraged to extract actionable insights from large sports datasets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
