Flash-KMeans: Fast, Memory-Efficient Exact K-Means for Billion-Scale Clustering on a Single GPU

Flash‑KMeans is a newly proposed framework that re‑designs exact K‑Means for GPUs by eliminating distance‑matrix materialization, using FlashAssign’s online argmin and Sort‑Inverse Update to cut memory bandwidth and atomic‑write contention, achieving up to 12.5× speedup and dramatically lower VRAM usage on billion‑point datasets.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
Flash-KMeans: Fast, Memory-Efficient Exact K-Means for Billion-Scale Clustering on a Single GPU

Motivation and Background

In modern AI, large‑scale retrieval and retrieval‑augmented generation pipelines rely on clustering to organise high‑dimensional vector spaces. Traditional K‑Means implementations, such as those in FAISS or scikit‑learn, materialise an N×K distance matrix, causing prohibitive memory traffic and VRAM consumption as the number of points N and clusters K grow.

Parametric vs. Non‑Parametric Methods

Parametric models assume a fixed functional form (e.g., linear regression) with a constant number of parameters.

Non‑parametric methods like K‑Nearest Neighbours (KNN) treat every data point as a parameter, allowing model capacity to grow with the dataset.

KNN Mechanics

KNN operates by computing the distance between a query point and every point in the dataset, sorting the distances, and selecting the K nearest neighbours. The algorithmic steps are:

Distance computation (usually Euclidean or cosine).

Sorting distances in ascending order.

Selecting the K smallest distances.

Voting (for classification) or assigning the point to the nearest cluster (for clustering).

Time and Memory Complexity

For N data points of dimension D, naïve KNN requires O(N·D) distance calculations per query and stores an intermediate distance matrix, leading to O(N·K) memory usage on the GPU. As N and D increase, VRAM quickly exhausts, causing crashes or severe performance drops.

Memory Allocation and Speed Bottlenecks

The standard GPU pipeline materialises a large distance matrix in global memory (VRAM), which is far slower than on‑chip SRAM. This creates an I/O‑bound situation where compute units idle while waiting for data transfers. Additionally, during the update phase, many threads attempt to write to the same centroid, causing atomic‑write contention that serialises updates and throttles throughput.

System‑level constraints include:

VRAM limits (e.g., 80 GB on an NVIDIA H100); exceeding this forces fallback to host RAM, incurring orders‑of‑magnitude slower speeds.

Real‑time latency requirements in production systems (recommendation engines, fraud detection) demand millisecond‑scale clustering.

Why KNN Must Be Fast

In high‑frequency applications, the O(N·D) cost becomes a bottleneck. Efficient spatial partitioning (Voronoi diagrams and Delaunay triangulation) can prune the search space, but the underlying KNN must still be accelerated to reap the benefits.

Core Retrieval: HNSW

State‑of‑the‑art vector databases use Hierarchical Navigable Small Worlds (HNSW) graphs for approximate nearest‑neighbour search. HNSW builds a multi‑level graph where upper layers have sparse long‑range connections for rapid traversal, and lower layers provide fine‑grained density. Slow KNN kernels degrade the entire graph traversal.

Algorithm‑System Co‑Design in Flash‑KMeans

FlashAssign – Online Argmin without Materialisation

FlashAssign fuses distance computation and reduction into a single streaming pass, never constructing the full distance matrix. For each data point x_i, the kernel keeps two registers: the current minimum distance m_i and the corresponding centroid index a_i. Tiles of centroids are scanned; each tile computes local distances, finds the tile‑wise minimum, and updates the registers via an online argmin logic. All intermediate results stay in registers, avoiding HBM traffic.

Latency hiding is achieved through two‑dimensional tiling and double‑buffering: while the kernel processes tile t, the next tile t+1 is prefetched asynchronously from HBM into a second on‑chip buffer, overlapping computation with memory transfer.

The streaming loop proceeds as:

Initialise m_i = +∞ and pre‑compute norms.

Sequentially scan each centroid tile.

Compute local distances, obtain the tile‑wise minimum, and update m_i, a_i via online argmin.

After all tiles, write the final assignments a to HBM.

This reduces the I/O complexity from O(N·K) to O(N·d + K·d), where d is the feature dimension.

Sort‑Inverse Update – Low‑Contention Centroid Aggregation

Standard GPU updates use a scatter pattern: each token writes its contribution to the centroid, causing heavy atomic contention. Flash‑KMeans instead performs a gather‑style aggregation:

Apply argsort(a) to obtain a sorted index array sorted_idx, grouping identical centroid IDs into contiguous segments.

Within each segment, accumulate sums and counts entirely in registers or shared memory.

Emit a single atomic_add per segment to HBM, drastically reducing the number of atomic operations from O(N·d) to O((K + ⌈N/B_N⌉)·d), where B_N is the block size.

The segment‑wise reduction eliminates write‑contention and speeds up the reduction phase.

System‑Level Optimisations

Flash‑KMeans streams data that exceed GPU VRAM by chunking the dataset and using CUDA streams with double‑buffering, keeping PCIe transfers and kernel execution overlapped. A cache‑aware compilation heuristic selects kernel launch parameters based on L1/L2 cache sizes, delivering near‑optimal performance without exhaustive auto‑tuning; first‑run configuration time drops from >300 s to <3 s.

Experimental Results

Benchmarks were run on an NVIDIA H200 (80 GB) GPU, comparing Flash‑KMeans against FAISS (the de‑facto GPU clustering baseline) and fastkmeans (a recent optimisation). Three workload categories were evaluated:

Large N, Large K (memory‑intensive): N = 1 M, K = 64 K. Standard PyTorch crashes due to OOM; Flash‑KMeans outperforms fastkmeans by >5.4×.

Large N, Small K (compute‑intensive): N = 8 M, K small. End‑to‑end latency reduced by 94.4 % (17.9× speedup).

Small N, Small K (framework‑overhead): Batch size = 32. Flash‑KMeans still achieves 15.3× acceleration.

Kernel‑level breakdown:

FlashAssign (assignment phase) on N = 1 M, K = 8192: execution time drops from 122.5 ms to 5.8 ms (21.2× faster).

Sort‑Inverse Update (reduction phase) on N = 33 M: achieves 6.3× speedup by replacing high‑contention scatter with segment‑wise reduction.

Out‑of‑core scaling to N = 10⁹, K = 32768 shows Flash‑KMeans completing one iteration in 41.4 s versus 261.8 s for fastkmeans (6.3–10.5× faster) while staying within VRAM limits.

The cache‑aware heuristic finds a near‑optimal kernel configuration in 2.5 s, a 175× reduction in first‑run latency, with less than 0.3 % performance loss compared to exhaustive search.

Importantly, all speedups preserve exact K‑Means convergence; no approximation (e.g., product quantisation) is introduced.

Conclusion

Flash‑KMeans demonstrates that precise K‑Means clustering can be re‑engineered to respect the physical hierarchy of modern GPUs, turning the algorithm from an I/O‑bound process into a compute‑bound one. By eliminating distance‑matrix materialisation (FlashAssign) and replacing scattered atomic updates with sorted‑inverse aggregation, the framework achieves up to 12.5× overall acceleration, dramatically lower memory footprints, and robust out‑of‑core performance, establishing a new efficiency frontier for large‑scale clustering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformanceClusteringGPUK-MeansMemory EfficiencyFlashAssignSort-Inverse Update
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.