Artificial Intelligence 20 min read

Boosting ANN Search with GPU: Inside RAFT’s IVF_INT8 Implementation

This article examines how Baidu and NVIDIA leveraged the open‑source RAFT library to build a GPU‑accelerated approximate nearest neighbor (ANN) retrieval system, detailing algorithm choices, offline indexing, online batch processing, performance results, and practical guidelines for deploying ANN on GPUs.

Baidu Geek Talk

Nov 20, 2024

Boosting ANN Search with GPU: Inside RAFT’s IVF_INT8 Implementation

The paper presents a GPU‑accelerated approximate nearest neighbor (ANN) online retrieval system built on NVIDIA's RAFT library and integrated with the Milvus vector database. It targets scenarios requiring millisecond‑level latency, >90% recall, and billion‑scale indexes.

ANN Algorithm Primer

IVF_FLAT : Partitions the vector space with a coarse quantizer, builds an inverted file for each centroid, and searches only selected lists to reduce distance computations.

IVF_PQ : Extends IVF_FLAT with product quantization and residual quantization, storing only encoded centroids to save memory at the cost of some precision.

Cagra : A GPU‑optimized graph‑based ANN algorithm that combines NN‑descent graph construction with edge pruning (NGT style) and merges forward/reverse edges for fast search.

Business Scenario Selection

Constraints: millisecond‑level latency, >90% recall, billion‑scale index.

Scenario: ultra‑high‑throughput search.

Goal: lower retrieval cost.

Given the massive throughput requirement, a GPU‑centric solution was chosen.

Design and Implementation

3.1 Index Algorithm Choice

RAFT provides four GPU‑ANN algorithms: BruteForce, IVF_FLAT, IVF_PQ, and Cagra. Benchmarking on the target workload showed that IVF_INT8 offers the best balance of memory usage, recall, and support for int8 quantization.

3.2 Offline Index Construction

Use int8 quantization to compress vectors (4× reduction) and managed memory ("super‑memory") to overflow GPU memory safely.

Parameter tuning:

Control average chain length (average number of vectors per inverted list) via kmeans_trainset_fraction and n_lists. Empirically keep the average between 1,000 and 2,000 for optimal performance.

Balance list lengths with RAFT’s kmeans_balanced component to avoid long‑tail computation.

// Example: enable managed memory for super‑memory
std::shared_ptr<rmm::mr::device_memory_resource> resource_ptr =
    std::make_shared<rmm::mr::managed_memory_resource>();
rmm::mr::set_current_device_resource(resource_ptr.get());
raft::device_resources dev_resources(rmm::cuda_stream_per_thread, {nullptr}, resource_ptr);

// Balanced k‑means example
template <typename DataT, typename MathT, typename IndexT, typename MappingOpT = raft::identity_op>
void fit(const raft::resources& handle,
         const kmeans_balanced_params& params,
         raft::device_matrix_view<const DataT, IndexT> X,
         raft::device_matrix_view<MathT, IndexT> centroids,
         MappingOpT mapping_op = raft::identity_op());

3.3 Online Retrieval

Incoming queries are batched, copied to GPU memory, processed with the IVF_INT8 search kernel, and results are copied back. Batch size is adjusted in real time based on traffic forecasts to keep latency within the millisecond budget while maximizing throughput.

// Simplified RAFT search API
template <typename T, typename IdxT>
void search(const raft::resources& handle,
            const search_params& params,
            const index<T, IdxT>& idx,
            const T* queries,
            uint32_t n_queries,
            uint32_t k,
            IdxT* neighbors,
            float* distances,
            rmm::mr::device_memory_resource* mr = nullptr);

3.4 Achieved Results

For a 25 million × 256‑dimensional dataset, the IVF_INT8 index occupies less than 10 GB of GPU memory and is built in under 10 minutes. A single GPU instance serves approximately 20 k queries per second (top‑30 results) with millisecond‑level latency, delivering a 30‑50% cost reduction compared with a CPU‑only baseline.

Discussion and Outlook

4.1 When to Use GPU for ANN

Suitable for high‑throughput, high‑recall workloads where computation dominates and the index fits (or can be made to fit) GPU memory.

Unsuitable for low‑traffic or highly sequential workloads, or when the dataset exceeds GPU memory limits without feasible compression.

4.2 Future Work

Planned improvements include deeper optimization of the IVF_INT8 implementation in collaboration with NVIDIA and exploration of emerging GPU‑friendly ANN algorithms.