Scalable Overload-Aware Graph-Based Index Construction for 10‑Billion‑Scale Vector Similarity Search (SOGAIC)
The paper introduces SOGAIC, a scalable overload‑aware graph‑based index construction system for billion‑scale vector similarity search that uses adaptive overlapping partitioning and load‑balanced distributed scheduling to cut construction time by 47.3% while maintaining high recall.
Abstract At WWW2025, XiaoHongShu presented SOGAIC, a scalable overload‑aware graph‑based index construction system for 10‑billion‑scale vector similarity search. By employing an adaptive overlapping partition algorithm and a load‑balanced task scheduler with hierarchical sub‑graph merging, SOGAIC reduces average construction time by 47.3% on real datasets and is already deployed in XiaoHongShu’s online vector retrieval engine.
1. Overview Approximate nearest neighbor search (ANNS) is critical for search engines, recommendation systems, and retrieval‑augmented generation in large language models. Graph‑based ANNS methods (e.g., HNSW, NSG, NGT) outperform traditional space‑partitioning approaches, but existing large‑scale graph construction techniques such as DiskANN struggle with scalability and frequent updates.
1.2 Challenges Existing overlapping partition strategies cause redundant overlaps and resource overloads due to uneven data distribution. Fixed‑overlap methods (e.g., DiskANN) lead to overloaded sub‑sets, while density‑based methods (e.g., DBSCAN) are too costly for ultra‑large datasets. Moreover, current graph construction pipelines lack efficient, scalable scheduling for sub‑graph building and merging.
1.3 Motivation To overcome these limitations, SOGAIC introduces an adaptive data partitioning strategy and a load‑balanced distributed scheduling framework that accelerates index construction while preserving graph quality.
2. System Overview SOGAIC consists of two stages: (1) an adaptive data partition algorithm that creates overlapping subsets while controlling overload boundaries, and (2) a distributed task scheduler that builds sub‑graphs in parallel and merges them hierarchically.
2.2 Adaptive Overload‑Aware Partitioning The algorithm first determines a per‑partition memory capacity \(\Gamma\) based on container limits, then estimates the minimum number of partitions \(\Phi\) needed for highly imbalanced data. After a small‑scale K‑means clustering to obtain initial centroids, vectors are assigned to partitions using a dynamic distance constraint and an adaptive relaxation factor \(\varepsilon\). When a partition reaches the overload threshold \(\Gamma\), the average distance parameter is reset, ensuring every vector belongs to at least one valid partition. The assignment process is parallelized and avoids redundant computation.
2.3 Distributed Graph Construction & Hierarchical Sub‑Graph Merging SOGAIC schedules sub‑graph construction tasks by sorting partitions by size and assigning larger tasks first to balance load. Tasks are dynamically allocated to the least‑loaded machine, minimizing incremental load. During merging, a hierarchical strategy leverages Spark or MapReduce to perform multi‑level in‑memory merges, reducing complexity from \(O(n)\) to \(O(\log n)\) and avoiding I/O bottlenecks. Overlap‑heavy sub‑graphs are merged earlier to further speed up the pipeline.
3. Experiments Experiments were conducted on five datasets (SIFT1M, SIFT1B, ISD3B, GloVe, VDD10B). Baselines included HNSW (Faiss), DiskANN, and SPTAG. Metrics measured construction time, recall, scalability, and resource usage.
3.1 Results SOGAIC achieved an average 47.3% reduction in construction time across all datasets larger than SIFT1M. It successfully handled billion‑scale datasets where HNSW ran out of memory and DiskANN failed due to severe partition imbalance. The adaptive overlap factor (max \(\Omega=4\), average 1.93 partitions per vector) reduced redundant computation while preventing overload.
3.2 Scalability Construction time decreased near‑linearly with added CPU cores. On the VDD10B dataset, SOGAIC completed in ~80 hours with 128 CPUs and under one day with 512 CPUs, outperforming SPTAG and DiskANN under the same configurations.
Conclusion SOGAIC provides a scalable, overload‑aware solution for building graph indexes on hundred‑billion‑scale vector databases, delivering significant speedups, strong scalability, and successful deployment in XiaoHongShu’s production search and recommendation pipelines.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.