Artificial Intelligence 6 min read

Comparative Study of Classification Algorithms and Calibration Using Synthetic Data

This article presents a comprehensive case study that explains classification principles, shows the key formulas for logistic regression and SVM, and provides a full Python implementation that generates synthetic data, trains multiple classifiers, calibrates them, and visualizes calibration curves and probability histograms.

IT Services Circle
IT Services Circle
IT Services Circle
Comparative Study of Classification Algorithms and Calibration Using Synthetic Data

In this article the author shares a comprehensive case study on classification, covering the underlying principles, mathematical formulas, and a complete Python implementation using synthetic data.

Principle

Classification is the task of predicting labels for given inputs. Common algorithms include logistic regression, naive Bayes, support vector machines (SVM) and random forests. Calibration converts raw classifier scores into well‑calibrated probabilities that reflect true event likelihoods.

Formulas

Logistic regression maps features to output probabilities via the sigmoid function, with model parameters denoted by the weight vector. The linear SVM decision function is defined similarly, with its own set of parameters.

Code Implementation

The following code generates a synthetic dataset, trains several classifiers, calibrates them, and visualizes calibration curves and probability histograms.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibrationDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import GaussianNB
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

# 生成数据集
X, y = make_classification(
    n_samples=100000, n_features=20, n_informative=2, n_redundant=2, random_state=42
)

train_samples = 100  # 用于训练模型的样本数量
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    shuffle=False,
    test_size=100_000 - train_samples,
)

# 自定义的 NaivelyCalibratedLinearSVC 类
class NaivelyCalibratedLinearSVC(LinearSVC):
    """带有 predict_proba 方法的 LinearSVC,简单地缩放 decision_function 输出。"""
    def fit(self, X, y):
        super().fit(X, y)
        df = self.decision_function(X)
        self.df_min_ = df.min()
        self.df_max_ = df.max()

    def predict_proba(self, X):
        """将 decision_function 输出的结果缩放到 [0,1] 区间。"""
        df = self.decision_function(X)
        calibrated_df = (df - self.df_min_) / (self.df_max_ - self.df_min_)
        proba_pos_class = np.clip(calibrated_df, 0, 1)
        proba_neg_class = 1 - proba_pos_class
        proba = np.c_[proba_neg_class, proba_pos_class]
        return proba

# 定义要比较的分类器
lr = LogisticRegressionCV(
    Cs=np.logspace(-6, 6, 101), cv=10, scoring="neg_log_loss", max_iter=1000
)
gnb = GaussianNB()
svc = NaivelyCalibratedLinearSVC(C=1.0)
rfc = RandomForestClassifier(random_state=42)

clf_list = [
    (lr, "Logistic Regression"),
    (gnb, "Naive Bayes"),
    (svc, "SVC"),
    (rfc, "Random forest"),
]

# 画图
fig = plt.figure(figsize=(10, 10))
gs = GridSpec(4, 2)
colors = plt.get_cmap("Dark2")

ax_calibration_curve = fig.add_subplot(gs[:2, :2])
calibration_displays = {}
markers = ["^", "v", "s", "o"]
for i, (clf, name) in enumerate(clf_list):
    clf.fit(X_train, y_train)
    display = CalibrationDisplay.from_estimator(
        clf,
        X_test,
        y_test,
        n_bins=10,
        name=name,
        ax=ax_calibration_curve,
        color=colors(i),
        marker=markers[i],
    )
    calibration_displays[name] = display

ax_calibration_curve.grid()
ax_calibration_curve.set_title("Calibration plots")

# 添加直方图
grid_positions = [(2, 0), (2, 1), (3, 0), (3, 1)]
for i, (_, name) in enumerate(clf_list):
    row, col = grid_positions[i]
    ax = fig.add_subplot(gs[row, col])
    ax.hist(
        calibration_displays[name].y_prob,
        range=(0, 1),
        bins=10,
        label=name,
        color=colors(i),
    )
    ax.set(title=name, xlabel="Mean predicted probability", ylabel="Count")

plt.tight_layout()
plt.show()

The subsequent explanation details data generation with make_classification , splitting the dataset into a tiny training set (100 samples) and a large test set, defining four classifiers (Logistic Regression, GaussianNB, a custom NaivelyCalibratedLinearSVC, and RandomForest), fitting each model, using CalibrationDisplay.from_estimator to plot calibration curves, and finally visualizing the results with Matplotlib.

machine learningpythonclassificationcalibrationscikit-learnsynthetic data
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.