Artificial Intelligence 8 min read

Unlock Hidden Patterns: A Deep Dive into Unsupervised Learning Techniques

This article introduces unsupervised learning, covering its motivation, Jensen's inequality, key clustering methods such as EM, k‑means, hierarchical clustering, evaluation metrics, and dimensionality‑reduction techniques like PCA and ICA, providing clear explanations and illustrative diagrams.

Model Perspective
Model Perspective
Model Perspective
Unlock Hidden Patterns: A Deep Dive into Unsupervised Learning Techniques

Unsupervised Learning

1.1 Introduction

Motivation

The goal of unsupervised learning is to discover hidden patterns in data that lack labels.

Jensen’s inequality : For a convex function f and a random variable X , the inequality f(E[X]) \le E[f(X)] holds, providing a fundamental bound used in many learning algorithms.

1.2 Clustering

1.2.1 Expectation‑Maximization (EM)

Latent variables

Latent variables are hidden or unobservable variables that make estimation problems more difficult; they are typically denoted by z . The most common setting with latent variables is shown in the diagram below.

EM iteratively builds a lower bound on the likelihood (E‑step) and then maximizes this bound (M‑step) to obtain maximum‑likelihood estimates of the parameters.

E‑step: Estimate the posterior probability of each data point belonging to each cluster.

M‑step: Use the posterior probabilities as weights to re‑estimate the cluster parameters.

1.2.2 k‑means clustering

Let x_i denote the i‑th point in a cluster and μ the cluster centroid.

Algorithm

The algorithm randomly initializes centroids and repeatedly assigns points to the nearest centroid, then updates centroids until convergence.

1.2.3 Hierarchical clustering

Algorithm

Hierarchical clustering builds nested clusters by repeatedly merging or splitting groups, producing a dendrogram that represents the hierarchy.

Types

Different linkage criteria optimize different objectives:

Ward linkage – minimizes within‑cluster variance.

Average linkage – minimizes average distance between clusters.

Complete linkage – minimizes maximum distance between clusters.

1.2.4 Clustering evaluation metrics

In unsupervised settings, evaluating model performance is challenging because ground‑truth labels are unavailable.

Silhouette coefficient

For a sample, let a be the average distance to other points in the same cluster and b the average distance to points in the nearest neighboring cluster; the silhouette coefficient is (b‑a)/max(a,b).

Calinski‑Harabasz index

Let K be the number of clusters, and let between‑cluster and within‑cluster dispersion matrices be defined as usual. The index is calculated as (trace(B)/(K‑1)) / (trace(W)/(N‑K)), where higher values indicate more distinct and compact clusters.

1.3 Dimension reduction

1.3.1 Principal Component Analysis (PCA)

PCA is a statistical technique that transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components via an orthogonal transformation.

Eigenvalues and eigenvectors

Given a matrix A , a vector v satisfying Av = λv is an eigenvector with eigenvalue λ . If A is symmetric, it can be diagonalized by an orthogonal matrix.

Note: The eigenvector associated with the largest eigenvalue is called the principal eigenvector.

Algorithm

Steps for PCA:

Standardize the data so that each feature has mean 0 and standard deviation 1.

Compute the covariance matrix and obtain its eigenvalues and eigenvectors.

Select the top k eigenvectors corresponding to the largest eigenvalues.

Project the data onto the selected eigenvectors, yielding a lower‑dimensional representation with maximal variance.

1.3.2 Independent Component Analysis (ICA)

ICA seeks to uncover latent source signals that have been linearly mixed. Assumptions We assume the observed data x is generated by mixing independent source vectors s through a non‑singular matrix A , i.e., x = As , where the components of s are statistically independent. Bell and Sejnowski ICA algorithm The algorithm finds an unmixing matrix W by maximizing the likelihood of the model. Using a chosen activation function, the log‑likelihood is derived, and a stochastic gradient ascent rule updates W for each training sample. Reference: MLEveryday https://github.com/MLEveryday/Machine-Learning-Cheatsheets

ClusteringPCAunsupervised learningk-meansdimensionality reductionEM algorithmICA
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.