Artificial Intelligence 19 min read

RTAMS-GANNS: A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighbor Search

RTAMS‑GANNS, the award‑winning real‑time adaptive multi‑stream GPU system for online approximate nearest neighbor search, eliminates costly memory allocations and serial execution by using a dynamic memory‑block insertion algorithm and separate CUDA streams, cutting latency by 40‑80% and reliably serving over 100 million daily users in production.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
RTAMS-GANNS: A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighbor Search

At the CIKM 2024 conference, the Xiaohongshu Vector Retrieval team received the Best Application Research Paper award for their work titled "A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighbor Search" (RTAMS‑GANNS). The paper can be accessed at https://dl.acm.org/doi/10.1145/3627673.3680054.

Approximate Nearest Neighbor Search (ANNS) is widely used in modern search, recommendation, and Retrieval‑Augmented Generation (RAG) systems. Traditional CPU‑based ANNS suffers from high compute load and memory‑bandwidth bottlenecks at large scale, prompting a shift toward GPU acceleration. Existing GPU‑based systems such as Faiss and Raft achieve impressive offline performance but cannot efficiently handle frequent real‑time vector insertions required by online services.

The authors identify two major limitations of current systems: (1) inefficient memory‑block management for vector insertion, leading to excessive GPU memory allocation and copy overhead; (2) a single‑stream serial execution model that blocks retrieval kernels when insertion kernels run, causing unacceptable latency in high‑QPS scenarios.

To overcome these challenges, RTAMS‑GANNS introduces three key innovations:

1) A dynamic, memory‑block‑based insertion algorithm that avoids frequent allocations by using a central memory pool and atomic operations to locate insertion positions, followed by an in‑place rearrangement step when a block exceeds a size threshold.

2) A multi‑stream parallel execution framework that assigns separate CUDA streams to retrieval and insertion kernels, managed by a stream‑cache resource pool. This design eliminates serialization, reduces kernel wait time, and fully exploits GPU parallelism.

3) Extensive experimental validation on both public (SIFT1M, DSSMRT40M) and industrial datasets, demonstrating latency reductions of 40%–80% across varying QPS levels. The system has been deployed on T4/A10 GPU servers for more than six months, serving over 100 million daily active users in production search and recommendation pipelines.

Additional studies examine the impact of rearrangement on latency, showing sub‑50 ms overhead, and evaluate different memory‑block sizes, revealing diminishing returns beyond a block size of 1024 vectors.

The paper also discloses two related patents filed in March 2024 (CN118096496A and CN118350979A).

GPUperformance evaluationhigh QPSMulti‑Streamapproximate nearest neighborOnline SearchVector Insertion
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.