An Introduction to k-means and Its Variants with Practical R Implementation
This article introduces the k‑means clustering algorithm and its major variants—k‑means++, k‑medoids, k‑medians, k‑modes, and k‑prototype—explaining their characteristics, step‑by‑step procedures, and provides a practical R implementation with code examples for data analysis.
Overview
k‑means is a classic clustering algorithm first published in 1967 and widely used across many fields. Over the years it has spawned a family of variants, including k‑means++, k‑medoids, k‑medians, k‑modes, k‑prototype and their fuzzy versions, all playing important roles in data analysis.
k‑means
Features
1. Efficient for large numeric datasets but cannot handle categorical attributes.
2. Works well when data form compact, well‑separated clouds; however it is sensitive to noise and outliers.
Steps
1. Randomly select k initial centroids, each representing a cluster.
2. Assign each point to the nearest centroid.
3. Recalculate each centroid as the arithmetic mean of points in its cluster.
4. Repeat steps 2–3 until one of the termination conditions is met: no point changes cluster, centroids stop moving, the within‑cluster sum of squares (WCSS) is minimized, or a maximum number of iterations is reached.
k‑means++
Features
1. Produces more accurate and stable clustering results than standard k‑means.
2. Still sensitive to noise and outliers.
3. Applicable only to numeric data.
Steps
1. Randomly choose one point as the first centroid.
2. For each remaining point, compute D(x), the distance to the nearest already‑chosen centroid; select a new centroid with probability proportional to D(x)².
3. Repeat step 2 until k centroids are chosen, then run the standard k‑means algorithm.
k‑medoids
Features
1. More accurate than k‑means when noise and outliers are present.
2. Computationally expensive (O(n²)) and slower on large datasets.
3. Works only with numeric data.
Steps
1. Randomly select k points as initial medoids.
2. Assign each remaining point to the nearest medoid.
3. For each cluster, replace the current medoid with a non‑medoid point if doing so reduces the total sum of absolute differences (SAD); otherwise keep the current medoid.
4. Repeat step 2–3 until the maximum number of iterations is reached or SAD no longer changes.
k‑medians
Features
1. Performs better than k‑means on data with noise and outliers.
2. Limited to continuous (numeric) data.
Steps
1. Randomly select k initial centroids.
2. Assign each point to the nearest centroid using Manhattan distance.
3. Recalculate each centroid as the median of points in its cluster (dimension‑wise).
4. Repeat steps 2–3 until centroids no longer change.
k‑modes
Features
1. Efficiently clusters categorical (discrete) data without sacrificing speed.
2. Uses the most frequent category (mode) for each attribute as the cluster prototype, which may ignore low‑frequency attribute values.
Steps
1. Randomly select k points as initial modes.
2. Compute Hamming distance (0 for same attribute, 1 for different) between each point and each mode.
3. Assign each point to the nearest mode, then recompute each mode as the most frequent attribute values within its cluster.
4. Repeat steps 2–3 until modes stabilize or a maximum iteration count is reached.
k‑prototype
Combines k‑means (for numeric attributes) and k‑modes (for categorical attributes) to handle mixed‑type datasets. Distance is computed as D = P1 + a·P2, where P1 is Euclidean distance for numeric attributes, P2 is Hamming distance for categorical attributes, and a is a weight factor.
Steps
1. Randomly select k points as initial prototypes and set the weight factor a.
2. Compute distances: Euclidean for numeric attributes, Hamming for categorical attributes; assign each point to the nearest prototype.
3. Update prototypes: numeric attributes become the arithmetic mean, categorical attributes become the mode.
4. Repeat steps 2–3 until prototypes stop changing or the maximum number of iterations is reached.
R Practical Implementation
Assuming a cleaned dataset named df :
library(cluster) # clustering algorithms
library(tidyverse) # data manipulation
library(factoextra) # clustering and visualization
Determine a suitable number of clusters (k) using the Elbow method or Silhouette coefficient:
set.seed(123)
fviz_nbclust(df, kmeans, method = "wss")
fviz_nbclust(df, kmeans, method = "silhouette")
Fit the k‑means model (e.g., k = 4):
set.seed(123)
result <- kmeans(df, centers = 4, iter.max = 15, nstart = 25)
Inspect the result structure:
str(result)
Access specific components, e.g., cluster assignments:
result$cluster
Visualize the clustering:
fviz_cluster(result, data = df)
NetEase LeiHuo UX Big Data Technology
The NetEase LeiHuo UX Data Team creates practical data‑modeling solutions for gaming, offering comprehensive analysis and insights to enhance user experience and enable precise marketing for development and operations. This account shares industry trends and cutting‑edge data knowledge with students and data professionals, aiming to advance the ecosystem together with enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.