Artificial Intelligence 6 min read

Mastering SVM: How Kernel Functions and Slack Variables Enable Perfect Classification

This article explains how kernel functions and slack variables empower Support Vector Machines to achieve zero training error on linearly inseparable data, presents three theoretical questions about Gaussian kernels, error‑free classification without slack variables, and the impact of the regularization parameter C when using SMO, and provides detailed analytical solutions.

Hulu Beijing

Jan 9, 2018

Mastering SVM: How Kernel Functions and Slack Variables Enable Perfect Classification

Scenario Description

When dealing with linearly inseparable data in SVM, kernel functions map data to a space where similarity becomes more separable, and slack variables allow some outliers to be misclassified, making the classifier robust.

Problem Description

Using a Gaussian kernel, prove that if no two training points coincide, there exist parameters \(\alpha_1,\dots,\alpha_m\), \(b\) and \(\gamma\) such that the SVM training error is zero.

If we train an SVM without slack variables using the \(\gamma\) from (1), can we still guarantee zero training error? Explain.

When training an SVM with slack variables using SMO and an unknown penalty \(C\), can we still achieve zero training error? Explain.

Prior Knowledge

SVM training process, kernel functions, SMO algorithm.

Solution and Analysis

1.

According to SVM theory, the prediction function can be written as:

where \((x_i, y_i)\) are the training samples and \(\{\alpha_i\}, b, \gamma\) are the parameters. For any \(i \neq j\) we have \(\|x_i - x_j\| \ge \varepsilon\). Setting \(\alpha_i = 1\) and \(b = 0\) gives a feasible solution. Choosing \(\gamma = \varepsilon / \log(1/2\,m)\) ensures the Gaussian kernel separates the data, leading to zero training error.

Substituting any \(x_j\) into the decision function shows the predicted distance from each sample does not exceed 1, so the training error is zero.

2.

To obtain zero training error we only need to show a feasible solution exists. The SVM constraint \(y_i (w^T x_i + b) \ge 1\) is the same as in (1). By setting \(b = 0\) and using the same \(\alpha_i\) values, the classifier satisfies the constraint, thus zero training error can still be achieved without slack variables.

3.

When training with slack variables using SMO, the optimization objective includes a regularization term weighted by \(C\). If \(C\) is small, the regularization term dominates, and the optimizer may prefer a solution with larger slack (i.e., non‑zero training error). For example, with \(C = 0\) the optimal solution is \(w = 0\), which clearly does not yield zero training error. Therefore, zero training error is not guaranteed when \(C\) is unknown or chosen arbitrarily.