Can Machine Learning Predict the World Cup Winner? A Complete Data Science Walkthrough
This article details how to gather FIFA and match data, engineer predictive features, build and evaluate Random Forest and Gradient Boosting models, and simulate the 2022 World Cup to forecast match outcomes and the eventual champion using Python and scikit‑learn.
Qatar World Cup – who will lift the trophy? Let's use machine learning to predict it!
Data Source
To build the machine learning model we need team data. We extract performance information from past matches and use FIFA rankings, both available on Kaggle.
Dataset Construction
We select quantifiable statistics such as goals, average ranking, points earned, etc., focusing on data that can be collected easily. Only matches from the 2022 World Cup cycle are considered.
import pandas as pd
import re
df = pd.read_csv("games/results.csv") # games between national teams
df["date"] = pd.to_datetime(df["date"])
df = df[(df["date"] >= "2018-8-1")].reset_index(drop=True) # games at the 2022 wc cycle
df_wc = df # pre-wc outcomes
rank = pd.read_csv("fifa_ranking-2022-10-06.csv") # rankings
rank["rank_date"] = pd.to_datetime(rank["rank_date"])
rank = rank[(rank["rank_date"] >= "2018-8-1")].reset_index(drop=True)
rank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States")
rank = rank.set_index(["rank_date"]).groupby(["country_full"], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()
rank_wc = rank
# Making the merge
df_wc_ranked = df_wc.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1)
df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "away_team"], right_on=["rank_date", "country_full"], suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)The above code creates the required dataset.
Feature Development
We generate candidate features such as average goals over the World Cup cycle and the last five matches, FIFA ranking differences, points changes, and whether the match is a friendly. These features aim to quantify team strength and recent performance.
World Cup cycle and last 5 matches average goals.
FIFA ranking position difference between teams.
Points increase in the ranking over the cycle.
Average points earned weighted by ranking.
Binary variable indicating if the match is a friendly.
These features are used to assess predictive power and decide which to keep.
Data Analysis
Before modeling we analyze feature distributions using violin and box plots to see how they relate to the target classes (home win vs. home draw/lose). Features like ranking difference show strong separation, while others do not.
Model
We build two tree‑based models: Random Forest and Gradient Boosting, and compare them using GridSearchCV to tune hyper‑parameters.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
# separating the target from the features
X = model_db.iloc[:, 3:]
y = model_db[["target"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
gb = GradientBoostingClassifier(random_state=5)
params = {
"learning_rate": [0.01, 0.1, 0.5],
"min_samples_split": [5, 10],
"min_samples_leaf": [3, 5],
"max_depth": [3, 5, 10],
"max_features": ["sqrt"],
"n_estimators": [100, 200]
}
gb_cv = GridSearchCV(gb, params, cv=3, n_jobs=-1, verbose=False)
gb_cv.fit(X_train.values, np.ravel(y_train))
gb = gb_cv.best_estimator_Similarly we tune a Random Forest model.
params_rf = {
"max_depth": [20],
"min_samples_split": [5, 10],
"max_leaf_nodes": [175, 200],
"min_samples_leaf": [5, 10],
"n_estimators": [250],
"max_features": ["sqrt"]
}
rf = RandomForestClassifier(random_state=1)
rf_cv = GridSearchCV(rf, params_rf, cv=3, n_jobs=-1, verbose=False)
rf_cv.fit(X_train.values, np.ravel(y_train))Model performance is evaluated with confusion matrices and ROC curves. The Random Forest shows slightly better accuracy but higher over‑fitting risk; Gradient Boosting offers comparable performance with lower over‑fit, so we select it for simulation.
World Cup Simulation
Using the trained Gradient Boosting model we simulate each group stage match, predict winners, and compute points. For draws we apply a rule based on double‑simulation (home vs. away and vice‑versa). The simulation proceeds through group stages, knockout rounds, and the final.
advanced_group = []
last_group = ""
# ... (simulation code omitted for brevity) ...Results show Brazil as the predicted champion, defeating England in the final with a 56% probability. Notable upsets include Belgium beating Germany and England reaching the final.
Conclusion
According to our predictions, Brazil will win the championship. Stay tuned to see if the model's forecast holds true!
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
