Comparative Study of Classification Algorithms and Calibration Using Synthetic Data
This article presents a comprehensive case study that explains classification principles, shows the key formulas for logistic regression and SVM, and provides a full Python implementation that generates synthetic data, trains multiple classifiers, calibrates them, and visualizes calibration curves and probability histograms.
In this article the author shares a comprehensive case study on classification, covering the underlying principles, mathematical formulas, and a complete Python implementation using synthetic data.
Principle
Classification is the task of predicting labels for given inputs. Common algorithms include logistic regression, naive Bayes, support vector machines (SVM) and random forests. Calibration converts raw classifier scores into well‑calibrated probabilities that reflect true event likelihoods.
Formulas
Logistic regression maps features to output probabilities via the sigmoid function, with model parameters denoted by the weight vector. The linear SVM decision function is defined similarly, with its own set of parameters.
Code Implementation
The following code generates a synthetic dataset, trains several classifiers, calibrates them, and visualizes calibration curves and probability histograms.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibrationDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import GaussianNB
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
# 生成数据集
X, y = make_classification(
n_samples=100000, n_features=20, n_informative=2, n_redundant=2, random_state=42
)
train_samples = 100 # 用于训练模型的样本数量
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
shuffle=False,
test_size=100_000 - train_samples,
)
# 自定义的 NaivelyCalibratedLinearSVC 类
class NaivelyCalibratedLinearSVC(LinearSVC):
"""带有 predict_proba 方法的 LinearSVC,简单地缩放 decision_function 输出。"""
def fit(self, X, y):
super().fit(X, y)
df = self.decision_function(X)
self.df_min_ = df.min()
self.df_max_ = df.max()
def predict_proba(self, X):
"""将 decision_function 输出的结果缩放到 [0,1] 区间。"""
df = self.decision_function(X)
calibrated_df = (df - self.df_min_) / (self.df_max_ - self.df_min_)
proba_pos_class = np.clip(calibrated_df, 0, 1)
proba_neg_class = 1 - proba_pos_class
proba = np.c_[proba_neg_class, proba_pos_class]
return proba
# 定义要比较的分类器
lr = LogisticRegressionCV(
Cs=np.logspace(-6, 6, 101), cv=10, scoring="neg_log_loss", max_iter=1000
)
gnb = GaussianNB()
svc = NaivelyCalibratedLinearSVC(C=1.0)
rfc = RandomForestClassifier(random_state=42)
clf_list = [
(lr, "Logistic Regression"),
(gnb, "Naive Bayes"),
(svc, "SVC"),
(rfc, "Random forest"),
]
# 画图
fig = plt.figure(figsize=(10, 10))
gs = GridSpec(4, 2)
colors = plt.get_cmap("Dark2")
ax_calibration_curve = fig.add_subplot(gs[:2, :2])
calibration_displays = {}
markers = ["^", "v", "s", "o"]
for i, (clf, name) in enumerate(clf_list):
clf.fit(X_train, y_train)
display = CalibrationDisplay.from_estimator(
clf,
X_test,
y_test,
n_bins=10,
name=name,
ax=ax_calibration_curve,
color=colors(i),
marker=markers[i],
)
calibration_displays[name] = display
ax_calibration_curve.grid()
ax_calibration_curve.set_title("Calibration plots")
# 添加直方图
grid_positions = [(2, 0), (2, 1), (3, 0), (3, 1)]
for i, (_, name) in enumerate(clf_list):
row, col = grid_positions[i]
ax = fig.add_subplot(gs[row, col])
ax.hist(
calibration_displays[name].y_prob,
range=(0, 1),
bins=10,
label=name,
color=colors(i),
)
ax.set(title=name, xlabel="Mean predicted probability", ylabel="Count")
plt.tight_layout()
plt.show()The subsequent explanation details data generation with make_classification , splitting the dataset into a tiny training set (100 samples) and a large test set, defining four classifiers (Logistic Regression, GaussianNB, a custom NaivelyCalibratedLinearSVC, and RandomForest), fitting each model, using CalibrationDisplay.from_estimator to plot calibration curves, and finally visualizing the results with Matplotlib.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.