Artificial Intelligence 6 min read

Comparative Study of Classification Algorithms and Calibration Using Synthetic Data

This article presents a comprehensive case study that explains classification principles, shows the key formulas for logistic regression and SVM, and provides a full Python implementation that generates synthetic data, trains multiple classifiers, calibrates them, and visualizes calibration curves and probability histograms.

IT Services Circle

Jul 9, 2024

Comparative Study of Classification Algorithms and Calibration Using Synthetic Data

In this article the author shares a comprehensive case study on classification, covering the underlying principles, mathematical formulas, and a complete Python implementation using synthetic data.

Principle

Classification is the task of predicting labels for given inputs. Common algorithms include logistic regression, naive Bayes, support vector machines (SVM) and random forests. Calibration converts raw classifier scores into well‑calibrated probabilities that reflect true event likelihoods.

Formulas

Logistic regression maps features to output probabilities via the sigmoid function, with model parameters denoted by the weight vector. The linear SVM decision function is defined similarly, with its own set of parameters.

Code Implementation

The following code generates a synthetic dataset, trains several classifiers, calibrates them, and visualizes calibration curves and probability histograms.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibrationDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import GaussianNB
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

# 生成数据集
X, y = make_classification(
    n_samples=100000, n_features=20, n_informative=2, n_redundant=2, random_state=42
)

train_samples = 100  # 用于训练模型的样本数量
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    shuffle=False,
    test_size=100_000 - train_samples,
)

# 自定义的 NaivelyCalibratedLinearSVC 类
class NaivelyCalibratedLinearSVC(LinearSVC):
    """带有 predict_proba 方法的 LinearSVC，简单地缩放 decision_function 输出。"""
    def fit(self, X, y):
        super().fit(X, y)
        df = self.decision_function(X)
        self.df_min_ = df.min()
        self.df_max_ = df.max()

    def predict_proba(self, X):
        """将 decision_function 输出的结果缩放到 [0,1] 区间。"""
        df = self.decision_function(X)
        calibrated_df = (df - self.df_min_) / (self.df_max_ - self.df_min_)
        proba_pos_class = np.clip(calibrated_df, 0, 1)
        proba_neg_class = 1 - proba_pos_class
        proba = np.c_[proba_neg_class, proba_pos_class]
        return proba

# 定义要比较的分类器
lr = LogisticRegressionCV(
    Cs=np.logspace(-6, 6, 101), cv=10, scoring="neg_log_loss", max_iter=1000
)
gnb = GaussianNB()
svc = NaivelyCalibratedLinearSVC(C=1.0)
rfc = RandomForestClassifier(random_state=42)

clf_list = [
    (lr, "Logistic Regression"),
    (gnb, "Naive Bayes"),
    (svc, "SVC"),
    (rfc, "Random forest"),
]

# 画图
fig = plt.figure(figsize=(10, 10))
gs = GridSpec(4, 2)
colors = plt.get_cmap("Dark2")

ax_calibration_curve = fig.add_subplot(gs[:2, :2])
calibration_displays = {}
markers = ["^", "v", "s", "o"]
for i, (clf, name) in enumerate(clf_list):
    clf.fit(X_train, y_train)
    display = CalibrationDisplay.from_estimator(
        clf,
        X_test,
        y_test,
        n_bins=10,
        name=name,
        ax=ax_calibration_curve,
        color=colors(i),
        marker=markers[i],
    )
    calibration_displays[name] = display

ax_calibration_curve.grid()
ax_calibration_curve.set_title("Calibration plots")

# 添加直方图
grid_positions = [(2, 0), (2, 1), (3, 0), (3, 1)]
for i, (_, name) in enumerate(clf_list):
    row, col = grid_positions[i]
    ax = fig.add_subplot(gs[row, col])
    ax.hist(
        calibration_displays[name].y_prob,
        range=(0, 1),
        bins=10,
        label=name,
        color=colors(i),
    )
    ax.set(title=name, xlabel="Mean predicted probability", ylabel="Count")

plt.tight_layout()
plt.show()

The subsequent explanation details data generation with make_classification, splitting the dataset into a tiny training set (100 samples) and a large test set, defining four classifiers (Logistic Regression, GaussianNB, a custom NaivelyCalibratedLinearSVC, and RandomForest), fitting each model, using CalibrationDisplay.from_estimator to plot calibration curves, and finally visualizing the results with Matplotlib.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python classification Calibration Scikit-learn synthetic data

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.