Can Python Predict the 2018 World Cup Champion? A Data‑Driven Analysis
Using a Kaggle dataset of over 40,000 matches from 1872 to 2018, this notebook demonstrates how to clean, transform, and visualize World Cup data with Python, pandas, and Matplotlib to identify top‑winning teams, total goal statistics, and forecast the most likely 2018 champion.
Data Source and Environment
The analysis uses a Kaggle dataset containing roughly 40,000 matches, including World Cups, qualifiers, Asian Cups, European Cups, and friendlies from 1872 to the present. The environment consists of Windows 7, Python 3.6, Jupyter Notebook, and pandas 0.22.0.
Initial Data Exploration
Key columns in the CSV file are date, home_team, away_team, home_score, away_score, tournament, city, country, and neutral.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
df = pd.read_csv('results.csv')
df.head()Filtering World Cup Matches
df_FIFA_all = df[df['tournament'].str.contains('FIFA', regex=True)]
df_FIFA = df_FIFA_all[df_FIFA_all['tournament'] == 'FIFA World Cup']
df_FIFA.head()Data Cleaning and Feature Engineering
df_FIFA.loc[:, 'date'] = pd.to_datetime(df_FIFA['date'])
df_FIFA['year'] = df_FIFA['date'].dt.year
df_FIFA['diff_score'] = df_FIFA['home_score'] - df_FIFA['away_score']
df_FIFA['win_team'] = ''
df_FIFA['diff_score'] = pd.to_numeric(df_FIFA['diff_score'])
# Determine winner for each match
df_FIFA.loc[df_FIFA['diff_score'] > 0, 'win_team'] = df_FIFA['home_team']
df_FIFA.loc[df_FIFA['diff_score'] < 0, 'win_team'] = df_FIFA['away_team']
df_FIFA.loc[df_FIFA['diff_score'] == 0, 'win_team'] = 'Draw'
df_FIFA.head()Top 20 Teams by Number of Wins
s = df_FIFA.groupby('win_team')['win_team'].count()
s.sort_values(ascending=False, inplace=True)
s.drop(labels=['Draw'], inplace=True)
s.head(20).plot(kind='bar', figsize=(10,6), title='Top 20 Winners of World Cup')Total Goals per Team
# Combine home and away scores
df_score_home = df_FIFA[['home_team', 'home_score']].rename(columns={'home_team':'team','home_score':'score'})
df_score_away = df_FIFA[['away_team', 'away_score']].rename(columns={'away_team':'team','away_score':'score'})
df_score = pd.concat([df_score_home, df_score_away], ignore_index=True)
s_score = df_score.groupby('team')['score'].sum()
s_score.sort_values(ascending=False, inplace=True)
s_score.tail(20).plot(kind='barh', figsize=(10,6), title='Top 20 in Total Scores of World Cup')Key Findings from Historical Data
By win count, Brazil, Germany, Italy, and Argentina have the strongest records.
By total goals, Germany, Brazil, Argentina, and Italy lead.
2018 World Cup – 32‑Team Analysis
The 32 qualified teams are grouped as follows (Group 1 to Group 4). The analysis first checks which teams are appearing for the first time.
team_list = ['Russia','Germany','Brazil','Portugal','Argentina','Belgium','Poland','France',
'Spain','Peru','Switzerland','England','Colombia','Mexico','Uruguay','Croatia',
'Denmark','Iceland','Costa Rica','Sweden','Tunisia','Egypt','Senegal','Iran',
'Serbia','Nigeria','Australia','Japan','Morocco','Panama','Korea Republic','Saudi Arabia']
# Identify debut teams
for item in team_list:
if item not in s_score.index:
print(item)Output shows that Iceland and Panama are debutants.
Since they lack historical data, they are excluded from the subsequent win/goal calculations for the 32‑team set.
df_top32 = df_FIFA[(df_FIFA['home_team'].isin(team_list)) & (df_FIFA['away_team'].isin(team_list))]Win Counts for the 32 Teams (1872‑present)
s_32 = df_top32.groupby('win_team')['win_team'].count()
s_32.sort_values(ascending=False, inplace=True)
s_32.drop(labels=['Draw'], inplace=True)
s_32.plot(kind='barh', figsize=(8,12), title='Top 32 of World Cup since year 1872')Total Goals for the 32 Teams (1872‑present)
# Re‑use the combined score dataframe for the 32 teams
df_score_home_32 = df_top32[['home_team','home_score']].rename(columns={'home_team':'team','home_score':'score'})
df_score_away_32 = df_top32[['away_team','away_score']].rename(columns={'away_team':'team','away_score':'score'})
df_score_32 = pd.concat([df_score_home_32, df_score_away_32], ignore_index=True)
s_score_32 = df_score_32.groupby('team')['score'].sum()
s_score_32.sort_values(ascending=False, inplace=True)
s_score_32.plot(kind='barh', figsize=(8,12), title='Top 32 in Total Scores of World Cup since year 1872')Insights from Different Time Windows
Since 1978 : Argentina, Germany, and Brazil dominate both win counts and goal totals, with Germany showing a clearer edge in scoring.
Since 2002 : Germany, Argentina, and Brazil remain the top three, again with Germany’s statistical advantage most pronounced.
Overall Prediction for the 2018 World Cup
Based on the historical performance of the 32 qualified teams, the analysis predicts Germany, Argentina, and Brazil as the most likely top three finishers, with Germany being the strongest candidate for the championship.
Note: This analysis is for personal learning purposes only; predictions may differ from actual outcomes and should not be used for any commercial or decision‑making purposes.
Source code is available upon request (e.g., reply “PyDataRoad” to the author’s public account).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
