Artificial Intelligence 19 min read

RTAMS-GANNS: A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighbor Search

RTAMS‑GANNS, the award‑winning real‑time adaptive multi‑stream GPU system for online approximate nearest neighbor search, eliminates costly memory allocations and serial execution by using a dynamic memory‑block insertion algorithm and separate CUDA streams, cutting latency by 40‑80% and reliably serving over 100 million daily users in production.

Xiaohongshu Tech REDtech

Nov 7, 2024

RTAMS-GANNS: A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighbor Search

At the CIKM 2024 conference, the Xiaohongshu Vector Retrieval team received the Best Application Research Paper award for their work titled "A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighbor Search" (RTAMS‑GANNS). The paper can be accessed at https://dl.acm.org/doi/10.1145/3627673.3680054.

Approximate Nearest Neighbor Search (ANNS) is widely used in modern search, recommendation, and Retrieval‑Augmented Generation (RAG) systems. Traditional CPU‑based ANNS suffers from high compute load and memory‑bandwidth bottlenecks at large scale, prompting a shift toward GPU acceleration. Existing GPU‑based systems such as Faiss and Raft achieve impressive offline performance but cannot efficiently handle frequent real‑time vector insertions required by online services.

The authors identify two major limitations of current systems: (1) inefficient memory‑block management for vector insertion, leading to excessive GPU memory allocation and copy overhead; (2) a single‑stream serial execution model that blocks retrieval kernels when insertion kernels run, causing unacceptable latency in high‑QPS scenarios.

To overcome these challenges, RTAMS‑GANNS introduces three key innovations:

1) A dynamic, memory‑block‑based insertion algorithm that avoids frequent allocations by using a central memory pool and atomic operations to locate insertion positions, followed by an in‑place rearrangement step when a block exceeds a size threshold.

2) A multi‑stream parallel execution framework that assigns separate CUDA streams to retrieval and insertion kernels, managed by a stream‑cache resource pool. This design eliminates serialization, reduces kernel wait time, and fully exploits GPU parallelism.

3) Extensive experimental validation on both public (SIFT1M, DSSMRT40M) and industrial datasets, demonstrating latency reductions of 40%–80% across varying QPS levels. The system has been deployed on T4/A10 GPU servers for more than six months, serving over 100 million daily active users in production search and recommendation pipelines.

Additional studies examine the impact of rearrangement on latency, showing sub‑50 ms overhead, and evaluate different memory‑block sizes, revealing diminishing returns beyond a block size of 1024 vectors.

The paper also discloses two related patents filed in March 2024 (CN118096496A and CN118350979A).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU performance evaluation high QPS multi-stream approximate nearest neighbor online search Vector Insertion

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.