Master Random Forest: From Bagging Theory to Python Implementation
This article explains the fundamentals of ensemble learning and bagging, details the random forest algorithm, answers common questions, and provides a complete Python walkthrough—including data exploration, decision‑tree baseline, random‑forest modeling with grid‑search tuning, and practical insights for handling imbalanced and missing data.
Preface
Ensemble learning , also known as bagging , can sometimes outperform deep learning in finance or non‑image domains. The goal of this article is to understand the basic principles and apply Python code to a real‑world business case: predicting broadband customer churn with a random‑forest model.
Detailed principle introduction
Python code practice
Ensemble Learning
The focus is on the random forest, which belongs to the bagging family. Bagging works by placing multiple models into a single "bag" and letting the bag act as a new model that makes predictions by majority vote.
For example, with 100,000 records and ten decision trees, each tree receives a different subset of the data. When a new record is evaluated, the ten trees each output a prediction; the final prediction is the proportion of trees voting for a class (e.g., 3 out of 10 trees predict 0, so the probability of class 0 is 0.3).
Common questions about bagging :
Q: How many samples should each model use? A: With ten trees and 100,000 records, each tree should use at least 10,000 samples. In practice, the sampling ratio of 1/n to 0.8 works well; using 100 % of the data for each model defeats the purpose of bagging.
Q: Does correlation between models affect the final decision? A: Models should be as uncorrelated as possible, which is mainly achieved by using different training samples for each model.
Q: If each model is over‑fitted, is that a problem? A: Over‑fitting of individual models is acceptable because the ensemble averages their predictions, reducing overall over‑fitting.
Q: Why sample rows and columns randomly? A: Random sampling of both rows and columns (features) ensures each model sees a different view of the data, similar to drawing diverse voters in an election.
Q: Are all trees weighted equally? A: Yes; each tree contributes 1/number_of_trees to the final vote. Different weights belong to boosting methods such as AdaBoost.
Q: Does adding more models or using a larger data proportion always improve performance? A: More models generally help, but the optimal data proportion depends on the dataset and algorithm specifics.
Advantages of Bagging
Higher accuracy than any single classifier.
Robust to noise.
Less prone to over‑fitting.
Random Forest
Random forest inherits the bagging process and adds specific hyper‑parameters such as the number of trees ( n_estimators) and the fraction of features used per tree ( max_features).
It works well for imbalanced data and datasets with many missing values, making it suitable for financial churn prediction.
Python Practice
Data Exploration
The target variable is broadband (0 = churn, 1 = stay). The dataset contains many features; only the target is examined for class imbalance.
import pandas as pd
import numpy as np
df = pd.read_csv('broadband.csv') # broadband customer data
print(df.head())
print(df.info())Class distribution:
from collections import Counter
print('Broadband:', Counter(df['broadband']))
# Output: Broadband: Counter({0: 908, 1: 206}) – noticeably imbalanced.Split data (dropping id and cust_id columns):
y = df['broadband']
X = df.iloc[:, 1:-1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.4, random_state=12345)Decision‑Tree Baseline
import sklearn.tree as tree
from sklearn.model_selection import GridSearchCV
param_grid = {
'criterion': ['entropy', 'gini'],
'max_depth': [2,3,4,5,6,7,8],
'min_samples_split': [4,8,12,16,20,24,28]
}
clf = tree.DecisionTreeClassifier()
clfcv = GridSearchCV(estimator=clf, param_grid=param_grid,
scoring='roc_auc', cv=4)
clfcv.fit(X_train, y_train)
test_est = clfcv.predict(X_test)
import sklearn.metrics as metrics
print('Decision Tree AUC:')
fpr, tpr, _ = metrics.roc_curve(y_test, test_est)
print('AUC = %.4f' % metrics.auc(fpr, tpr))The decision‑tree model achieves modest performance (AUC just above 0.5), indicating room for improvement.
Random‑Forest Modeling
from sklearn import ensemble
param_grid = {
'criterion': ['entropy', 'gini'],
'max_depth': [5,6,7,8],
'n_estimators': [11,13,15],
'max_features': [0.3,0.4,0.5],
'min_samples_split': [4,8,12,16]
}
rf = ensemble.RandomForestClassifier()
rf_cv = GridSearchCV(estimator=rf, param_grid=param_grid,
scoring='roc_auc', cv=4)
rf_cv.fit(X_train, y_train)
test_est = rf_cv.predict(X_test)
print('Random Forest AUC:')
fpr, tpr, _ = metrics.roc_curve(y_test, test_est)
print('AUC = %.4f' % metrics.auc(fpr, tpr))After tuning, the random‑forest model shows a substantial accuracy boost compared with the single decision tree.
Further tuning suggestions include expanding the search ranges for max_depth, n_estimators, and max_features based on the initial grid‑search results.
Adjusted Parameter Grid
param_grid = {
'criterion': ['entropy', 'gini'],
'max_depth': [7,8,10,12],
'n_estimators': [11,13,15,17,19],
'max_features': [0.4,0.5,0.6,0.7],
'min_samples_split': [2,3,4,8,12,16]
}Re‑running the model with the expanded grid yields further performance improvements, confirming that careful hyper‑parameter selection is essential.
Conclusion
Random forest is a classic ensemble method with simple theory, elegant implementation, and strong practical performance. It works well for imbalanced or heavily missing data and is applicable beyond finance to any domain where data quality is a challenge.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
