Artificial Intelligence 22 min read

Can Machine Learning Predict the World Cup Winner? A Complete Data Science Walkthrough

This article details how to gather FIFA and match data, engineer predictive features, build and evaluate Random Forest and Gradient Boosting models, and simulate the 2022 World Cup to forecast match outcomes and the eventual champion using Python and scikit‑learn.

Python Crawling & Data Mining

Dec 9, 2022

Can Machine Learning Predict the World Cup Winner? A Complete Data Science Walkthrough

Qatar World Cup – who will lift the trophy? Let's use machine learning to predict it!

Data Source

To build the machine learning model we need team data. We extract performance information from past matches and use FIFA rankings, both available on Kaggle.

Dataset Construction

We select quantifiable statistics such as goals, average ranking, points earned, etc., focusing on data that can be collected easily. Only matches from the 2022 World Cup cycle are considered.

import pandas as pd
import re

df = pd.read_csv("games/results.csv")  # games between national teams
df["date"] = pd.to_datetime(df["date"])
df = df[(df["date"] >= "2018-8-1")].reset_index(drop=True)  # games at the 2022 wc cycle

df_wc = df  # pre-wc outcomes

rank = pd.read_csv("fifa_ranking-2022-10-06.csv")  # rankings
rank["rank_date"] = pd.to_datetime(rank["rank_date"])
rank = rank[(rank["rank_date"] >= "2018-8-1")].reset_index(drop=True)
rank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States")
rank = rank.set_index(["rank_date"]).groupby(["country_full"], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()
rank_wc = rank

# Making the merge
df_wc_ranked = df_wc.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1)

df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "away_team"], right_on=["rank_date", "country_full"], suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)

The above code creates the required dataset.

Feature Development

We generate candidate features such as average goals over the World Cup cycle and the last five matches, FIFA ranking differences, points changes, and whether the match is a friendly. These features aim to quantify team strength and recent performance.

World Cup cycle and last 5 matches average goals.

FIFA ranking position difference between teams.

Points increase in the ranking over the cycle.

Average points earned weighted by ranking.

Binary variable indicating if the match is a friendly.

These features are used to assess predictive power and decide which to keep.

Data Analysis

Before modeling we analyze feature distributions using violin and box plots to see how they relate to the target classes (home win vs. home draw/lose). Features like ranking difference show strong separation, while others do not.

Model

We build two tree‑based models: Random Forest and Gradient Boosting, and compare them using GridSearchCV to tune hyper‑parameters.

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# separating the target from the features
X = model_db.iloc[:, 3:]
y = model_db[["target"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

gb = GradientBoostingClassifier(random_state=5)
params = {
    "learning_rate": [0.01, 0.1, 0.5],
    "min_samples_split": [5, 10],
    "min_samples_leaf": [3, 5],
    "max_depth": [3, 5, 10],
    "max_features": ["sqrt"],
    "n_estimators": [100, 200]
}

gb_cv = GridSearchCV(gb, params, cv=3, n_jobs=-1, verbose=False)
gb_cv.fit(X_train.values, np.ravel(y_train))

gb = gb_cv.best_estimator_

Similarly we tune a Random Forest model.

params_rf = {
    "max_depth": [20],
    "min_samples_split": [5, 10],
    "max_leaf_nodes": [175, 200],
    "min_samples_leaf": [5, 10],
    "n_estimators": [250],
    "max_features": ["sqrt"]
}

rf = RandomForestClassifier(random_state=1)
rf_cv = GridSearchCV(rf, params_rf, cv=3, n_jobs=-1, verbose=False)
rf_cv.fit(X_train.values, np.ravel(y_train))

Model performance is evaluated with confusion matrices and ROC curves. The Random Forest shows slightly better accuracy but higher over‑fitting risk; Gradient Boosting offers comparable performance with lower over‑fit, so we select it for simulation.

World Cup Simulation

Using the trained Gradient Boosting model we simulate each group stage match, predict winners, and compute points. For draws we apply a rule based on double‑simulation (home vs. away and vice‑versa). The simulation proceeds through group stages, knockout rounds, and the final.

advanced_group = []
last_group = ""
# ... (simulation code omitted for brevity) ...

Results show Brazil as the predicted champion, defeating England in the final with a 56% probability. Notable upsets include Belgium beating Germany and England reaching the final.

Conclusion

According to our predictions, Brazil will win the championship. Stay tuned to see if the model's forecast holds true!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning feature engineering data simulation Random Forest football prediction Scikit-learn gradient boosting

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.