Artificial Intelligence 6 min read

What Classic Diagrams Reveal About Test Error, Overfitting, and Model Selection

The article presents a series of insightful diagrams that illustrate core machine‑learning concepts such as the relationship between training and test error, the dangers of under‑ and over‑fitting, Occam’s razor, feature interactions, discriminative versus generative models, loss functions, least‑squares geometry, and sparsity.

MaGe Linux Operations

Sep 21, 2018

What Classic Diagrams Reveal About Test Error, Overfitting, and Model Selection

Key Machine Learning Diagrams

When explaining basic machine‑learning concepts, I often return to a handful of illustrative diagrams. Below is a list of the most insightful ones.

Test and training error

Why a low training error is not always desirable: the figure shows test and training error curves as model complexity varies.

Under and overfitting

Examples of under‑fitting and over‑fitting: polynomial curves of varying degree (M) are shown in red, with the green curve fitting the data.

Occam’s razor

The diagram explains how Bayesian inference embodies Occam’s razor: a simple model (H1) has higher evidence for a given dataset than a more complex model (H2) when both have equal priors.

Feature combinations

Why individually irrelevant but jointly correlated features matter, and why linear methods may fail, as illustrated in Isabelle Guyon’s feature‑extraction slides.

Irrelevant features

Irrelevant features can degrade K‑Nearest Neighbors, clustering, and other similarity‑based methods; the right‑hand plot adds an unrelated axis that disrupts grouping.

Basis functions

Non‑linear basis functions transform a low‑dimensional non‑linear classification problem into a high‑dimensional linear one, as shown in Andrew Moore’s SVM tutorial (e.g., mapping x to (x, x²)).

Discriminative vs. Generative

Discriminative learning is often simpler: the left plot shows class‑conditional density p(x|C₁) (blue curve) which does not affect posterior probabilities, while the right plot shows the decision boundary (green line) that minimizes error.

Loss functions

Learning algorithms can be viewed as optimizing different loss functions: the hinge loss for SVM (blue), a scaled loss for logistic regression (red), misclassification loss (black), and mean‑squared error (green).

Geometry of least squares

The figure shows the N‑dimensional geometry of least‑squares regression with two predictors: the response vector y is orthogonally projected onto the plane spanned by input vectors x₁ and x₂.

Sparsity

Lasso (L₁ regularization or Laplace prior) yields sparse solutions with many zero coefficients. The left plot shows the Lasso estimate, the right plot shows ridge regression; the red ellipse represents the least‑squares error contours, while the blue region denotes the constraint |β₁|+|β₂| ≤ t (Lasso) or β₁²+β₂² ≤ t² (ridge).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning feature engineering overfitting model selection sparsity Loss Functions bias‑variance

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.