Efficient Python Implementation of K-Means Clustering with Performance Comparison
This article introduces a concise Python implementation of the k‑means clustering algorithm, compares its speed with a typical implementation, provides full source code, a data‑generation helper, and demonstrates the results on a synthetic dataset of 10,000 points.
The author explains that machine learning can be divided into four areas—classification, clustering, regression, and dimensionality reduction—and that k‑means, while simple, is an excellent introductory exercise.
A custom k‑means implementation written in fewer than 20 lines processes 10,000 samples in about 20 ms for ten iterations, which is over a hundred times faster than many popular versions.
import numpy as np
def kmeans_xufive(ds, k):
"""k-means聚类算法
k - 指定分簇数量
ds - ndarray(m, n),m个样本的数据集,每个样本n个属性值
"""
m, n = ds.shape # m:样本数量,n:每个样本的属性值个数
result = np.empty(m, dtype=np.int) # m个样本的聚类结果
cores = np.empty((k, n)) # k个质心
cores = ds[np.random.choice(np.arange(m), k, replace=False)] # 随机选择k个样本作为质心
while True: # 迭代计算
d = np.square(np.repeat(ds, k, axis=0).reshape(m, k, n) - cores)
distance = np.sqrt(np.sum(d, axis=2)) # ndarray(m, k),每个样本距离k个质心的距离
index_min = np.argmin(distance, axis=1) # 每个样本最近的质心索引
if (index_min == result).all(): # 如果聚类没有改变
return result, cores # 返回结果和质心
result[:] = index_min # 重新分类
for i in range(k): # 遍历质心
items = ds[result == i]
cores[i] = np.mean(items, axis=0) # 更新质心位置The article also presents a more conventional k‑means version (kmeans_open) that includes helper functions for loading data, computing Euclidean distance, initializing random centroids, and the full clustering loop.
import numpy as np
def loadDataSet(fileName):
data = np.loadtxt(fileName, delimiter='\t')
return data
def distEclud(x, y):
return np.sqrt(np.sum((x - y) ** 2))
def randCent(dataSet, k):
m, n = dataSet.shape
centroids = np.zeros((k, n))
for i in range(k):
index = int(np.random.uniform(0, m))
centroids[i, :] = dataSet[index, :]
return centroids
def kmeans_open(dataSet, k):
m = np.shape(dataSet)[0]
clusterAssment = np.mat(np.zeros((m, 2)))
clusterChange = True
centroids = randCent(dataSet, k)
while clusterChange:
clusterChange = False
for i in range(m):
minDist = 1e5
minIndex = -1
for j in range(k):
distance = distEclud(centroids[j, :], dataSet[i, :])
if distance < minDist:
minDist = distance
minIndex = j
if clusterAssment[i, 0] != minIndex:
clusterChange = True
clusterAssment[i, :] = minIndex, minDist ** 2
for j in range(k):
pointsInCluster = dataSet[np.nonzero(clusterAssment[:, 0].A == j)[0]]
centroids[j, :] = np.mean(pointsInCluster, axis=0)
return clusterAssment.A[:, 0], centroidsA utility function create_data_set generates synthetic data for testing, allowing the user to specify multiple centroids and the number of points around each.
def create_data_set(*cores):
"""生成k-means聚类测试用数据集"""
ds = []
for x0, y0, z0 in cores:
x = np.random.normal(x0, 0.1 + np.random.random() / 3, z0)
y = np.random.normal(y0, 0.1 + np.random.random() / 3, z0)
ds.append(np.stack((x, y), axis=1))
return np.vstack(ds)The testing script creates a dataset with four clusters (2,500 points each), runs both implementations, measures execution time, and visualizes the clustering results with Matplotlib.
import time
import matplotlib.pyplot as plt
k = 4
ds = create_data_set((0,0,2500), (0,2,2500), (2,0,2500), (2,2,2500))
# Test custom implementation
t0 = time.time()
result, cores = kmeans_xufive(ds, k)
t = time.time() - t0
plt.scatter(ds[:,0], ds[:,1], s=1, c=result.astype(np.int))
plt.scatter(cores[:,0], cores[:,1], marker='x', c=np.arange(k))
plt.show()
print('custom kmeans_xufive: %.3f seconds' % t)
# Test conventional implementation
t0 = time.time()
result, cores = kmeans_open(ds, k)
t = time.time() - t0
plt.scatter(ds[:,0], ds[:,1], s=1, c=result.astype(np.int))
plt.scatter(cores[:,0], cores[:,1], marker='x', c=np.arange(k))
plt.show()
print('conventional kmeans_open: %.3f seconds' % t)Running the tests shows the custom version completes in roughly 0.016 seconds, while the conventional version takes about 4 seconds for the same 10,000‑point dataset, demonstrating a substantial speed advantage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
