Artificial Intelligence 8 min read

Efficient Python Implementation of K-Means Clustering with Performance Comparison

This article introduces a concise Python implementation of the k‑means clustering algorithm, compares its speed with a typical implementation, provides full source code, a data‑generation helper, and demonstrates the results on a synthetic dataset of 10,000 points.

Python Programming Learning Circle

Aug 14, 2021

Efficient Python Implementation of K-Means Clustering with Performance Comparison

The author explains that machine learning can be divided into four areas—classification, clustering, regression, and dimensionality reduction—and that k‑means, while simple, is an excellent introductory exercise.

A custom k‑means implementation written in fewer than 20 lines processes 10,000 samples in about 20 ms for ten iterations, which is over a hundred times faster than many popular versions.

import numpy as np

def kmeans_xufive(ds, k):
    """k-means聚类算法

    k       - 指定分簇数量
    ds      - ndarray(m, n)，m个样本的数据集，每个样本n个属性值
    """
    m, n = ds.shape  # m：样本数量，n：每个样本的属性值个数
    result = np.empty(m, dtype=np.int)  # m个样本的聚类结果
    cores = np.empty((k, n))  # k个质心
    cores = ds[np.random.choice(np.arange(m), k, replace=False)]  # 随机选择k个样本作为质心

    while True:  # 迭代计算
        d = np.square(np.repeat(ds, k, axis=0).reshape(m, k, n) - cores)
        distance = np.sqrt(np.sum(d, axis=2))  # ndarray(m, k)，每个样本距离k个质心的距离
        index_min = np.argmin(distance, axis=1)  # 每个样本最近的质心索引

        if (index_min == result).all():  # 如果聚类没有改变
            return result, cores  # 返回结果和质心

        result[:] = index_min  # 重新分类
        for i in range(k):  # 遍历质心
            items = ds[result == i]
            cores[i] = np.mean(items, axis=0)  # 更新质心位置

The article also presents a more conventional k‑means version (kmeans_open) that includes helper functions for loading data, computing Euclidean distance, initializing random centroids, and the full clustering loop.

import numpy as np

def loadDataSet(fileName):
    data = np.loadtxt(fileName, delimiter='\t')
    return data

def distEclud(x, y):
    return np.sqrt(np.sum((x - y) ** 2))

def randCent(dataSet, k):
    m, n = dataSet.shape
    centroids = np.zeros((k, n))
    for i in range(k):
        index = int(np.random.uniform(0, m))
        centroids[i, :] = dataSet[index, :]
    return centroids

def kmeans_open(dataSet, k):
    m = np.shape(dataSet)[0]
    clusterAssment = np.mat(np.zeros((m, 2)))
    clusterChange = True
    centroids = randCent(dataSet, k)
    while clusterChange:
        clusterChange = False
        for i in range(m):
            minDist = 1e5
            minIndex = -1
            for j in range(k):
                distance = distEclud(centroids[j, :], dataSet[i, :])
                if distance < minDist:
                    minDist = distance
                    minIndex = j
            if clusterAssment[i, 0] != minIndex:
                clusterChange = True
                clusterAssment[i, :] = minIndex, minDist ** 2
        for j in range(k):
            pointsInCluster = dataSet[np.nonzero(clusterAssment[:, 0].A == j)[0]]
            centroids[j, :] = np.mean(pointsInCluster, axis=0)
    return clusterAssment.A[:, 0], centroids

A utility function create_data_set generates synthetic data for testing, allowing the user to specify multiple centroids and the number of points around each.

def create_data_set(*cores):
    """生成k-means聚类测试用数据集"""
    ds = []
    for x0, y0, z0 in cores:
        x = np.random.normal(x0, 0.1 + np.random.random() / 3, z0)
        y = np.random.normal(y0, 0.1 + np.random.random() / 3, z0)
        ds.append(np.stack((x, y), axis=1))
    return np.vstack(ds)

The testing script creates a dataset with four clusters (2,500 points each), runs both implementations, measures execution time, and visualizes the clustering results with Matplotlib.

import time
import matplotlib.pyplot as plt

k = 4
ds = create_data_set((0,0,2500), (0,2,2500), (2,0,2500), (2,2,2500))

# Test custom implementation
t0 = time.time()
result, cores = kmeans_xufive(ds, k)
t = time.time() - t0
plt.scatter(ds[:,0], ds[:,1], s=1, c=result.astype(np.int))
plt.scatter(cores[:,0], cores[:,1], marker='x', c=np.arange(k))
plt.show()
print('custom kmeans_xufive: %.3f seconds' % t)

# Test conventional implementation
t0 = time.time()
result, cores = kmeans_open(ds, k)
t = time.time() - t0
plt.scatter(ds[:,0], ds[:,1], s=1, c=result.astype(np.int))
plt.scatter(cores[:,0], cores[:,1], marker='x', c=np.arange(k))
plt.show()
print('conventional kmeans_open: %.3f seconds' % t)

Running the tests shows the custom version completes in roughly 0.016 seconds, while the conventional version takes about 4 seconds for the same 10,000‑point dataset, demonstrating a substantial speed advantage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance machine learning K-Means

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.