How GPU‑Accelerated ANN Search Cuts Costs and Boosts Throughput in High‑Volume Retrieval

This article analyzes a GPU‑based approximate nearest neighbor (ANN) retrieval solution built on NVIDIA's RAFT library, detailing algorithm selection, offline indexing tricks, batch online search design, performance results on a 25‑million‑vector workload, and cost‑saving implications for large‑scale search services.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
How GPU‑Accelerated ANN Search Cuts Costs and Boosts Throughput in High‑Volume Retrieval

Introduction

Approximate nearest neighbor (ANN) vector search is widely used to improve user experience in natural‑language and deep‑learning driven retrieval systems. While CPU‑based ANN solutions are common, GPUs offer far greater parallelism, prompting research into GPU‑accelerated ANN for high‑throughput scenarios.

ANN Overview

ANN algorithms fall into four categories: tree‑based, LSH‑based, quantization‑based, and graph‑based. The article briefly introduces three representative GPU‑friendly methods:

IVF_FLAT : Uses a coarse quantizer to partition the vector space into inverted lists; only vectors in selected lists are examined during search.

IVF_PQ : Extends IVF with product quantization (PQ) and residual quantization to reduce memory usage, at the cost of some recall loss.

Cagra : A graph‑based ANN algorithm built on NN‑descent and edge‑pruning, optimized for GPU parallelism.

Business Scenario

The target use case is a Baidu search service that expands each user query to retrieve a diverse set of results. Requirements include millisecond‑level latency, >90% recall, and billions of indexed vectors, demanding ultra‑high throughput and lower retrieval cost.

Implementation

3.1 Index Algorithm Selection

The NVIDIA RAFT library provides four GPU‑ANN algorithms (BruteForce, IVF_FLAT, IVF_PQ, Cagra) with a unified interface for building, serializing, loading, and searching indexes. After benchmarking, IVF_INT8 (a variant of IVF_FLAT with int8 quantization) was chosen as the best fit.

// T is the data type, IdxT is the ID type
template<typename T, typename IdxT>
auto build(raft::resources const& handle,
           const index_params& params,
           raft::device_matrix_view<const T, IdxT, row_major> dataset)
    -> index<T, IdxT>;

template<typename T, typename IdxT>
void serialize(raft::resources const& handle,
               const std::string& filename,
               const index<T, IdxT>& index);

template<typename T, typename IdxT>
index<T, IdxT> deserialize(raft::resources const& handle,
                         const std::string& filename);

template<typename T, typename IdxT>
void search(raft::resources const& handle,
            const search_params& params,
            const index<T, IdxT>& index,
            const T* queries,
            uint32_t n_queries,
            uint32_t k,
            IdxT* neighbors,
            float* distances,
            rmm::mr::device_memory_resource* mr = nullptr);

3.2 Offline Indexing

To increase per‑shard capacity, the team applied int8 quantization and NVIDIA’s “super‑memory” technique, which swaps data between host memory and GPU memory when GPU memory is insufficient.

std::shared_ptr<rmm::mr::device_memory_resource> resource_ptr =
    std::make_shared<rmm::mr::managed_memory_resource>();
rmm::mr::set_current_device_resource(resource_ptr.get());
raft::device_resources dev_resources(rmm::cuda_stream_per_thread, {nullptr}, resource_ptr);

Key tuning parameters include kmeans_trainset_fraction (controls the fraction of data used to train centroids) and n_lists (determines the number of inverted lists). Empirical experience suggests keeping the average list length between 1,000 and 2,000 for balanced performance.

3.3 Online Retrieval

The online pipeline batches incoming queries, copies the batch to GPU memory, runs IVF_INT8 search, and copies results back. Distance computation is parallelized using interleaved_scan_kernel, and the top‑K results are selected with matrix::detail::select_k. Batch size is dynamically adjusted based on recent traffic using a pre‑trained estimator to avoid excessive queuing latency.

Online retrieval pipeline
Online retrieval pipeline

3.4 Achieved Results

On a 25‑million‑vector, 256‑dimensional dataset, the GPU solution built the index in under 10 minutes using less than 10 GB of GPU memory. A single service instance served roughly 20 k queries per second (top‑30 results) with millisecond‑level latency, delivering 30‑50% cost savings compared with a CPU‑only baseline.

Discussion

GPU acceleration is advantageous when the workload has large computational scale, high recall requirements, and algorithms that can be expressed with minimal data dependencies. It is less suitable for low‑traffic or highly sequential tasks, or when GPU memory limits cannot be mitigated.

Future Work

The team plans to further optimize the RAFT IVF_INT8 implementation with NVIDIA, and to explore additional GPU‑centric ANN algorithms to broaden applicability.

References

[1] https://github.com/rapidsai/raft

[2] R. M. Gray and D. L. Neuhoff, “Quantization,” *IEEE Trans. Inf. Theory*, 1998.

[3] H. Jégou, M. Douze, C. Schmid, “Product Quantization for Nearest Neighbor Search,” *IEEE TPAMI*, 2011.

[4] Y. Chen, T. Guan, C. Wang, “Approximate Nearest Neighbor Search by Residual Vector Quantization,” *Sensors*, 2010.

[5] Y. Malkov, D. Yashunin, “Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs,” 2016.

[6] H. Ootomo et al., “CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs,” arXiv:2308.15136, 2023.

[7] M. Wang et al., “A Comprehensive Survey and Experimental Comparison of Graph‑Based Approximate Nearest Neighbor Search,” arXiv:2101.12631, 2021.

[8] M. Iwasaki, D. Miyazaki, “Optimization of Indexing Based on k‑Nearest Neighbor Graph,” arXiv:1810.07355, 2018.

[9] https://github.com/milvus-io/milvus

Vector SearchGPURaftANNIVF_INT8
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.