Mastering Supervised Learning: From Linear Models to SVMs and Beyond
An extensive overview of supervised learning introduces key concepts, model types, loss functions, optimization methods, linear and generalized linear models, support vector machines, generative approaches, tree and ensemble techniques, as well as foundational learning theory, providing a comprehensive foundation for AI practitioners.
Supervised Learning
1.1 Introduction
Given a set of data points and corresponding outputs, we aim to build a classifier that learns how to predict the outputs from the inputs.
Prediction types: regression (continuous output) and classification (categorical output).
1.2 Symbols and Concepts
Hypothesis: the chosen model that maps inputs to predictions.
Loss function: measures the discrepancy between predicted values and true values.
Cost function: often used together with the loss function to evaluate model performance.
1.3 Optimization Methods
Gradient descent: updates parameters using the learning rate and the gradient of the cost function.
Note: Stochastic gradient descent (SGD) updates parameters for each training example, while batch gradient descent uses a batch of examples.
Likelihood: model likelihood is maximized to find optimal parameters; in practice the log‑likelihood is used for easier optimization.
Newton's method: a numerical technique that iteratively finds a root of a function; the update rule follows the Newton‑Raphson formulation.
Note: Newton‑Raphson is the multidimensional generalization of Newton's method.
1.4 Linear Models
1.4.1 Linear Regression
Normal equations provide a closed‑form solution for the minimum of the cost function.
Least‑Mean‑Squares (LMS) algorithm updates parameters based on the learning rate and the training set.
Locally Weighted Regression (LWR) weights each training sample's cost function, defining a weighted prediction.
1.4.2 Classification and Logistic Regression
Sigmoid (activation) function maps real‑valued inputs to the (0,1) interval.
Logistic regression models the probability of a binary class; it has no closed‑form solution.
Softmax regression extends logistic regression to multi‑class problems by modeling a categorical distribution.
1.4.3 Generalized Linear Models (GLMs)
Exponential family distributions can be expressed with natural (canonical) parameters and a link function.
GLMs aim to predict a random variable as a function of explanatory variables under three key assumptions.
Note: Ordinary least squares and logistic regression are special cases of GLMs.
1.5 Support Vector Machines (SVM)
Optimal‑margin classifier maximizes the minimum distance between the decision boundary and training points.
Hinge loss defines the loss for SVMs.
Kernel trick computes inner products in a high‑dimensional feature space without explicit mapping; the Gaussian (RBF) kernel is most common.
Lagrangian formulation introduces Lagrange multipliers to enforce constraints in the optimization problem.
1.6 Generative Learning
1.6.1 Gaussian Discriminant Analysis (GDA)
Assumes class‑conditional Gaussian distributions with shared or distinct covariance matrices; parameters are estimated via maximum likelihood.
1.6.2 Naive Bayes
Assumes feature independence given the class; maximum likelihood yields simple closed‑form parameter estimates and is widely used for text classification.
Note: Naive Bayes is extensively applied in document classification.
1.7 Tree‑Based and Ensemble Methods
Decision Trees (CART) provide interpretable models for both regression and classification.
Boosting combines many weak learners to form a strong learner.
Random Forest builds an ensemble of decision trees using bootstrap sampling; it sacrifices some interpretability for improved performance.
1.8 Other Non‑Parametric Methods
k‑Nearest Neighbors (k‑NN) predicts a point’s label based on the majority label of its k closest neighbors.
Note: Larger k increases bias, while smaller k increases variance.
1.9 Learning Theory
Union bound provides an upper bound on the probability of the union of events.
Hoeffding inequality (Chernoff bound) bounds the deviation of the sample mean from its expectation for bounded variables.
Training error (empirical risk) measures the average loss on the training set.
Probably Approximately Correct (PAC) framework assumes training and test samples are drawn i.i.d. from the same distribution.
Shattering describes the ability of a hypothesis class to realize all labelings on a set of points.
Upper bound theorem gives a probabilistic bound on the error based on hypothesis class complexity.
VC dimension quantifies the capacity of a hypothesis class by the size of the largest set it can shatter.
Vapnik’s theorem relates VC dimension, sample size, and generalization error.
References
MLEveryday – https://github.com/MLEveryday/Machine-Learning-Cheatsheets
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.