Fundamentals 4 min read

Understanding Sample Similarity: Distance Metrics and Cluster Methods

This article explains how to quantify similarity between data samples using distance metrics such as Manhattan, Euclidean, and Chebyshev, outlines the properties these distances must satisfy, and describes common inter‑class measures like single linkage, complete linkage, centroid, group average, and sum‑of‑squares methods.

Model Perspective

Jun 4, 2022

Understanding Sample Similarity: Distance Metrics and Cluster Methods

1 Sample Similarity Measure

To classify objects quantitatively, we must describe the similarity between them using numbers. Each object is represented by multiple variables, so a set of samples can be seen as points in a multidimensional space, and distance can naturally measure similarity.

Let X be the set of sample points, and d(x_i, x_j) a function satisfying:

Non‑negativity and identity of indiscernibles: d(x_i, x_j) = 0 iff i = j.

Symmetry: d(x_i, x_j) = d(x_j, x_i).

Triangle inequality: d(x_i, x_k) ≤ d(x_i, x_j) + d(x_j, x_k).

This definition fulfills positivity, symmetry, and the triangle inequality. In clustering analysis for quantitative variables, the most common distance is the Minkowski distance.

When the order parameter p = 1, we obtain the Manhattan (absolute) distance; p = 2 gives the Euclidean distance; and p → ∞ yields the Chebyshev distance.

The Euclidean distance is most frequently used because it remains invariant under orthogonal rotations of the coordinate axes.

Note: When using Minkowski distance, variables must have the same units. If they differ, standardize the data before computing distances.

2 Inter‑Class Similarity Measures

If there are two clusters A and B, several methods can measure the distance between them:

Nearest‑neighbor (single linkage) : the smallest distance between any two points, one from each cluster.

Farthest‑neighbor (complete linkage) : the largest distance between any two points, one from each cluster.

Centroid method : the distance between the centroids (means) of the two clusters.

Group average method : the average of all pairwise distances between points of the two clusters.

Sum of squares method : based on the sum of squared deviations; a larger inter‑class distance indicates better separation.

Reference: ThomsonRen GitHub https://github.com/ThomsonRen/mathmodels

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Clustering distance metrics similarity Minkowski

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.