Artificial Intelligence 9 min read

Mastering Supervised Learning: From Linear Models to SVMs and Beyond

An extensive overview of supervised learning introduces key concepts, model types, loss functions, optimization methods, linear and generalized linear models, support vector machines, generative approaches, tree and ensemble techniques, as well as foundational learning theory, providing a comprehensive foundation for AI practitioners.

Model Perspective

Jan 7, 2023

Supervised Learning

1.1 Introduction

Given a set of data points and corresponding outputs, we aim to build a classifier that learns how to predict the outputs from the inputs.

Prediction types: regression (continuous output) and classification (categorical output).

1.2 Symbols and Concepts

Hypothesis: the chosen model that maps inputs to predictions.

Loss function: measures the discrepancy between predicted values and true values.

Cost function: often used together with the loss function to evaluate model performance.

1.3 Optimization Methods

Gradient descent: updates parameters using the learning rate and the gradient of the cost function.

Note: Stochastic gradient descent (SGD) updates parameters for each training example, while batch gradient descent uses a batch of examples.

Likelihood: model likelihood is maximized to find optimal parameters; in practice the log‑likelihood is used for easier optimization.

Newton's method: a numerical technique that iteratively finds a root of a function; the update rule follows the Newton‑Raphson formulation.

Note: Newton‑Raphson is the multidimensional generalization of Newton's method.

1.4 Linear Models

1.4.1 Linear Regression

Normal equations provide a closed‑form solution for the minimum of the cost function.

Least‑Mean‑Squares (LMS) algorithm updates parameters based on the learning rate and the training set.

Locally Weighted Regression (LWR) weights each training sample's cost function, defining a weighted prediction.

1.4.2 Classification and Logistic Regression

Sigmoid (activation) function maps real‑valued inputs to the (0,1) interval.

Logistic regression models the probability of a binary class; it has no closed‑form solution.

Softmax regression extends logistic regression to multi‑class problems by modeling a categorical distribution.

1.4.3 Generalized Linear Models (GLMs)

Exponential family distributions can be expressed with natural (canonical) parameters and a link function.

GLMs aim to predict a random variable as a function of explanatory variables under three key assumptions.

Note: Ordinary least squares and logistic regression are special cases of GLMs.

1.5 Support Vector Machines (SVM)

Optimal‑margin classifier maximizes the minimum distance between the decision boundary and training points.

Hinge loss defines the loss for SVMs.

Kernel trick computes inner products in a high‑dimensional feature space without explicit mapping; the Gaussian (RBF) kernel is most common.

Lagrangian formulation introduces Lagrange multipliers to enforce constraints in the optimization problem.

1.6 Generative Learning

1.6.1 Gaussian Discriminant Analysis (GDA)

Assumes class‑conditional Gaussian distributions with shared or distinct covariance matrices; parameters are estimated via maximum likelihood.

1.6.2 Naive Bayes

Assumes feature independence given the class; maximum likelihood yields simple closed‑form parameter estimates and is widely used for text classification.

Note: Naive Bayes is extensively applied in document classification.

1.7 Tree‑Based and Ensemble Methods

Decision Trees (CART) provide interpretable models for both regression and classification.

Boosting combines many weak learners to form a strong learner.

Random Forest builds an ensemble of decision trees using bootstrap sampling; it sacrifices some interpretability for improved performance.

1.8 Other Non‑Parametric Methods

k‑Nearest Neighbors (k‑NN) predicts a point’s label based on the majority label of its k closest neighbors.

Note: Larger k increases bias, while smaller k increases variance.

1.9 Learning Theory

Union bound provides an upper bound on the probability of the union of events.

Hoeffding inequality (Chernoff bound) bounds the deviation of the sample mean from its expectation for bounded variables.

Training error (empirical risk) measures the average loss on the training set.

Probably Approximately Correct (PAC) framework assumes training and test samples are drawn i.i.d. from the same distribution.

Shattering describes the ability of a hypothesis class to realize all labelings on a set of points.

Upper bound theorem gives a probabilistic bound on the error based on hypothesis class complexity.

VC dimension quantifies the capacity of a hypothesis class by the size of the largest set it can shatter.

Vapnik’s theorem relates VC dimension, sample size, and generalization error.

References

MLEveryday – https://github.com/MLEveryday/Machine-Learning-Cheatsheets

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI generative models svm supervised learning linear models

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.