Fundamentals 12 min read

Can Python Predict the 2018 World Cup Champion? A Data‑Driven Analysis

This article demonstrates how to use Python, pandas, and Jupyter Notebook to explore a comprehensive World Cup dataset, clean and enrich the data, visualize win and goal statistics for all teams, and finally predict the top three contenders for the 2018 tournament.

Efficient Ops
Efficient Ops
Efficient Ops
Can Python Predict the 2018 World Cup Champion? A Data‑Driven Analysis

1. Retrieve all World Cup match data (excluding qualifiers)

<ol><li><code>import pandas as pd</code></li><li><code>import matplotlib.pyplot as plt</code></li><li><code>%matplotlib inline</code></li><li><code>plt.style.use('ggplot')</code></li><li><code>df = pd.read_csv('results.csv')</code></li><li><code>df.head()</code></li></ol>

The dataset contains columns such as date, home team, away team, home goals (excluding penalties), away goals, match type, city, country, and whether the match was on neutral ground.

Environment

Windows 7

Python 3.6

Jupyter Notebook

pandas 0.22.0

First, convert the date column to datetime and create additional columns:

<ol><li><code>df_FIFA.loc[:, 'date'] = pd.to_datetime(df_FIFA.loc[:, 'date'])</code></li><li><code>df_FIFA['year'] = df_FIFA['date'].dt.year</code></li><li><code>df_FIFA['diff_score'] = df_FIFA['home_score'] - df_FIFA['away_score']</code></li><li><code>df_FIFA['win_team'] = ''</code></li><li><code>df_FIFA['diff_score'] = pd.to_numeric(df_FIFA['diff_score'])</code></li></ol>

Determine the winning team:

<ol><li><code># Method 1</code></li><li><code>df_FIFA.loc[df_FIFA['diff_score']>0, 'win_team'] = df_FIFA['home_team']</code></li><li><code>df_FIFA.loc[df_FIFA['diff_score']<0, 'win_team'] = df_FIFA['away_team']</code></li><li><code>df_FIFA.loc[df_FIFA['diff_score']==0, 'win_team'] = 'Draw'</code></li></ol>
<ol><li><code># Method 2</code></li><li><code>def find_win_team(df):</code></li><li><code>    winners = []</code></li><li><code>    for i, row in df.iterrows():</code></li><li><code>        if row['home_score'] > row['away_score']:</code></li><li><code>            winners.append(row['home_team'])</code></li><li><code>        elif row['home_score'] < row['away_score']:</code></li><li><code>            winners.append(row['away_team'])</code></li><li><code>        else:</code></li><li><code>            winners.append('Draw')</code></li><li><code>    return winners</code></li><li><code>df_FIFA['winner'] = find_win_team(df_FIFA)</code></li></ol>

2. Top‑20 winners of all World Cup matches

Group by winning team and count victories:

<code>s = df_FIFA.groupby('win_team')['win_team'].count()
 s.sort_values(ascending=False, inplace=True)
 s.drop(labels=['Draw'], inplace=True)
 s.head(20).plot(kind='bar', figsize=(10,6), title='Top 20 Winners of World Cup')
</code>

Horizontal bar chart:

<code>s.sort_values(ascending=True, inplace=True)
 s.tail(20).plot(kind='barh', figsize=(10,6), title='Top 20 Winners of World Cup')
</code>

Pie chart of the same data:

<code>s_percentage = s / s.sum()
 s_percentage.tail(20).plot(kind='pie', figsize=(10,10), autopct='%.1f%%', startangle=173, title='Top 20 Winners of World Cup')
</code>
Analysis conclusion 1: Based on win counts, Brazil, Germany, Italy and Argentina are the strongest teams.

3. Goal totals by country

<code>df_score_home = df_FIFA[['home_team','home_score']]
 df_score_home.columns = ['team','score']
 df_score_away = df_FIFA[['away_team','away_score']]
 df_score_away.columns = ['team','score']
 df_score = pd.concat([df_score_home, df_score_away], ignore_index=True)
 s_score = df_score.groupby('team')['score'].sum()
 s_score.sort_values(ascending=False, inplace=True)
 s_score.tail(20).plot(kind='barh', figsize=(10,6), title='Top 20 in Total Scores of World Cup')
</code>
Analysis conclusion 2: Based on total goals, Germany, Brazil, Argentina and Italy lead.

4. 2018 World Cup – 32‑team analysis

Identify first‑time participants:

<code>team_list = ['Russia','Germany','Brazil','Portugal','Argentina','Belgium','Poland','France','Spain','Peru','Switzerland','England','Colombia','Mexico','Uruguay','Croatia','Denmark','Iceland','Costa Rica','Sweden','Tunisia','Egypt','Senegal','Iran','Serbia','Nigeria','Australia','Japan','Morocco','Panama','Korea Republic','Saudi Arabia']
for item in team_list:
    if item not in s_score.index:
        print(item)
# Output: Iceland, Panama
</code>

Thus Iceland and Panama are debutants.

<code>df_top32 = df_FIFA[(df_FIFA['home_team'].isin(team_list)) & (df_FIFA['away_team'].isin(team_list))]
</code>

4.1 Win counts for the 32 teams (since 1872)

<code>s_32 = df_top32.groupby('win_team')['win_team'].count()
 s_32.sort_values(ascending=False, inplace=True)
 s_32.drop(labels=['Draw'], inplace=True)
 s_32.sort_values(ascending=True, inplace=True)
 s_32.plot(kind='barh', figsize=(8,12), title='Top 32 of World Cup since year 1872')
</code>

4.2 Goal totals for the 32 teams (since 1872)

<code>df_score_home_32 = df_top32[['home_team','home_score']]
 df_score_home_32.columns = ['team','score']
 df_score_away_32 = df_top32[['away_team','away_score']]
 df_score_away_32.columns = ['team','score']
 df_score_32 = pd.concat([df_score_home_32, df_score_away_32], ignore_index=True)
 s_score_32 = df_score_32.groupby('team')['score'].sum()
 s_score_32.sort_values(ascending=False, inplace=True)
 s_score_32.plot(kind='barh', figsize=(8,12), title='Top 32 in Total Scores of World Cup since year 1872')
</code>
Analysis conclusion 3: Since 1872, Germany, Brazil and Argentina are the strongest among the 32‑team pool, both in wins and goals.

4.3 Since 1978 (last 10 editions)

Win‑count chart:

Goal‑total chart:

Analysis conclusion 4: Since 1978, Argentina, Germany and Brazil lead in wins; the same three lead in goals, with Germany showing a clearer advantage.

4.4 Since 2002 (last 4 editions)

Win‑count chart:

Goal‑total chart:

Analysis conclusion 5: Since 2002, Germany, Argentina and Brazil dominate both win counts and goal totals, with Germany’s advantage most pronounced.

5. Overall conclusion

Based on historical World Cup data, the predicted top three for the 2018 tournament are Germany, Argentina and Brazil, with Germany being the strongest candidate for the championship.

Special note: This analysis is for personal learning purposes only; predictions may differ from actual outcomes and should not be used for other purposes.
Pythondata analysisVisualizationpandasSports AnalyticsWorld Cup
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.