Artificial Intelligence 11 min read

An Introduction to k-means and Its Variants with Practical R Implementation

This article introduces the k‑means clustering algorithm and its major variants—k‑means++, k‑medoids, k‑medians, k‑modes, and k‑prototype—explaining their characteristics, step‑by‑step procedures, and provides a practical R implementation with code examples for data analysis.

NetEase LeiHuo UX Big Data Technology

Feb 24, 2022

An Introduction to k-means and Its Variants with Practical R Implementation

Overview

k‑means is a classic clustering algorithm first published in 1967 and widely used across many fields. Over the years it has spawned a family of variants, including k‑means++, k‑medoids, k‑medians, k‑modes, k‑prototype and their fuzzy versions, all playing important roles in data analysis.

k‑means

Features

1. Efficient for large numeric datasets but cannot handle categorical attributes.

2. Works well when data form compact, well‑separated clouds; however it is sensitive to noise and outliers.

Steps

1. Randomly select k initial centroids, each representing a cluster.

2. Assign each point to the nearest centroid.

3. Recalculate each centroid as the arithmetic mean of points in its cluster.

4. Repeat steps 2–3 until one of the termination conditions is met: no point changes cluster, centroids stop moving, the within‑cluster sum of squares (WCSS) is minimized, or a maximum number of iterations is reached.

k‑means++

Features

1. Produces more accurate and stable clustering results than standard k‑means.

2. Still sensitive to noise and outliers.

3. Applicable only to numeric data.

Steps

1. Randomly choose one point as the first centroid.

2. For each remaining point, compute D(x), the distance to the nearest already‑chosen centroid; select a new centroid with probability proportional to D(x)².

3. Repeat step 2 until k centroids are chosen, then run the standard k‑means algorithm.

k‑medoids

Features

1. More accurate than k‑means when noise and outliers are present.

2. Computationally expensive (O(n²)) and slower on large datasets.

3. Works only with numeric data.

Steps

1. Randomly select k points as initial medoids.

2. Assign each remaining point to the nearest medoid.

3. For each cluster, replace the current medoid with a non‑medoid point if doing so reduces the total sum of absolute differences (SAD); otherwise keep the current medoid.

4. Repeat step 2–3 until the maximum number of iterations is reached or SAD no longer changes.

k‑medians

Features

1. Performs better than k‑means on data with noise and outliers.

2. Limited to continuous (numeric) data.

Steps

1. Randomly select k initial centroids.

2. Assign each point to the nearest centroid using Manhattan distance.

3. Recalculate each centroid as the median of points in its cluster (dimension‑wise).

4. Repeat steps 2–3 until centroids no longer change.

k‑modes

Features

1. Efficiently clusters categorical (discrete) data without sacrificing speed.

2. Uses the most frequent category (mode) for each attribute as the cluster prototype, which may ignore low‑frequency attribute values.

Steps

1. Randomly select k points as initial modes.

2. Compute Hamming distance (0 for same attribute, 1 for different) between each point and each mode.

3. Assign each point to the nearest mode, then recompute each mode as the most frequent attribute values within its cluster.

4. Repeat steps 2–3 until modes stabilize or a maximum iteration count is reached.

k‑prototype

Combines k‑means (for numeric attributes) and k‑modes (for categorical attributes) to handle mixed‑type datasets. Distance is computed as D = P1 + a·P2, where P1 is Euclidean distance for numeric attributes, P2 is Hamming distance for categorical attributes, and a is a weight factor.

Steps

1. Randomly select k points as initial prototypes and set the weight factor a.

2. Compute distances: Euclidean for numeric attributes, Hamming for categorical attributes; assign each point to the nearest prototype.

3. Update prototypes: numeric attributes become the arithmetic mean, categorical attributes become the mode.

4. Repeat steps 2–3 until prototypes stop changing or the maximum number of iterations is reached.

R Practical Implementation

Assuming a cleaned dataset named df:

library(cluster)  # clustering algorithms

library(tidyverse)  # data manipulation

library(factoextra)  # clustering and visualization

Determine a suitable number of clusters (k) using the Elbow method or Silhouette coefficient:

set.seed(123)

fviz_nbclust(df, kmeans, method = "wss")

fviz_nbclust(df, kmeans, method = "silhouette")

Fit the k‑means model (e.g., k = 4):

set.seed(123)

result <- kmeans(df, centers = 4, iter.max = 15, nstart = 25)

Inspect the result structure: str(result) Access specific components, e.g., cluster assignments: result$cluster Visualize the clustering:

fviz_cluster(result, data = df)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

clustering K-Means R

Written by

NetEase LeiHuo UX Big Data Technology

The NetEase LeiHuo UX Data Team creates practical data‑modeling solutions for gaming, offering comprehensive analysis and insights to enhance user experience and enable precise marketing for development and operations. This account shares industry trends and cutting‑edge data knowledge with students and data professionals, aiming to advance the ecosystem together with enthusiasts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.