Understanding Sample Similarity: Distance Metrics and Cluster Methods
This article explains how to quantify similarity between data samples using distance metrics such as Manhattan, Euclidean, and Chebyshev, outlines the properties these distances must satisfy, and describes common inter‑class measures like single linkage, complete linkage, centroid, group average, and sum‑of‑squares methods.
1 Sample Similarity Measure
To classify objects quantitatively, we must describe the similarity between them using numbers. Each object is represented by multiple variables, so a set of samples can be seen as points in a multidimensional space, and distance can naturally measure similarity.
Let X be the set of sample points, and d(x_i, x_j) a function satisfying:
Non‑negativity and identity of indiscernibles: d(x_i, x_j) = 0 iff i = j.
Symmetry: d(x_i, x_j) = d(x_j, x_i).
Triangle inequality: d(x_i, x_k) ≤ d(x_i, x_j) + d(x_j, x_k).
This definition fulfills positivity, symmetry, and the triangle inequality. In clustering analysis for quantitative variables, the most common distance is the Minkowski distance.
When the order parameter p = 1, we obtain the Manhattan (absolute) distance; p = 2 gives the Euclidean distance; and p → ∞ yields the Chebyshev distance.
The Euclidean distance is most frequently used because it remains invariant under orthogonal rotations of the coordinate axes.
Note: When using Minkowski distance, variables must have the same units. If they differ, standardize the data before computing distances.
2 Inter‑Class Similarity Measures
If there are two clusters A and B, several methods can measure the distance between them:
Nearest‑neighbor (single linkage) : the smallest distance between any two points, one from each cluster.
Farthest‑neighbor (complete linkage) : the largest distance between any two points, one from each cluster.
Centroid method : the distance between the centroids (means) of the two clusters.
Group average method : the average of all pairwise distances between points of the two clusters.
Sum of squares method : based on the sum of squared deviations; a larger inter‑class distance indicates better separation.
Reference: ThomsonRen GitHub https://github.com/ThomsonRen/mathmodels
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.