Artificial Intelligence 20 min read

GPU-Accelerated Mixed Vector-Scalar Retrieval System for Meituan Takeaway Search

Meituan Waimai’s search team built a GPU‑accelerated, mixed vector‑and‑scalar retrieval engine that supports billions of items, achieving over 99% recall and up to 89% latency reduction by combining pre‑filtering, optimized data layouts, multi‑GPU parallelism, and FP16 precision.

Meituan Technology Team

Apr 11, 2024

GPU-Accelerated Mixed Vector-Scalar Retrieval System for Meituan Takeaway Search

Background

With the rise of big data and AI, vector search has become essential for recommendation, Q&A, and large‑language‑model applications. Traditional brute‑force search is infeasible at large scale, so Approximate Nearest Neighbor (ANN) methods such as HNSW, IVF, IVF‑PQ, and IVF‑PQ+Refine are adopted.

Evolution of Meituan Waimai Vector Indexes

HNSW

Hierarchical Navigable Small World builds a layered graph for efficient high‑dimensional search but suffers performance loss under high filter ratios, which is problematic for the strong LBS filtering in Meituan Waimai.

IVF

Inverted File partitions the vector space into clusters, reducing the number of vectors examined but requires storing the full vector set in memory.

IVF‑PQ

Product Quantization compresses vectors into sub‑vectors, saving memory at the cost of some recall loss.

IVF‑PQ+Refine

Stores raw vectors on SSD, recalls a larger candidate set with IVF‑PQ, then refines using the original vectors to regain accuracy.

Geo‑aware Retrieval

Geohash encoding of merchant locations allows incorporating distance into similarity scoring, improving recall for location‑based services.

Goals and Challenges

Support mixed vector + scalar retrieval with high‑filter ratios (>99%).

Maintain recall ≥95% while handling candidate sets of >1 billion vectors.

Keep 99th‑percentile latency (Tp99) under 20 ms.

Scale to >100 million candidates.

Existing GPU‑based solutions (Faiss, Milvus) excel at pure vector search but lack mixed retrieval capabilities, prompting a custom design.

Solution Exploration

Pre‑filter vs. Post‑filter

Pre‑filter applies scalar filters first, then vector search; post‑filter does the opposite, requiring an expansion factor N to compensate for filtered‑out results.

Post‑filter was initially favored for low development risk, but benchmarks showed insufficient recall for Flat, IVF, and IVFPQ without large expansion factors.

Pre‑filter Implementation Options

Store all raw data in GPU memory and perform both scalar filtering and vector computation on GPU – rejected due to high memory consumption and poor filter performance.

Keep all data in CPU memory, perform scalar filtering on CPU, then copy filtered vectors to GPU – rejected because data transfer latency (PCIe Gen4 64 GB/s) could not meet the latency target.

Store vectors in GPU memory, scalar attributes in CPU memory; perform scalar filtering on CPU, then pass index lists to GPU for vector computation – selected.

GPU Vector Retrieval System

Data Layout

Scalar fields are column‑stored in CPU memory with inverted indexes for fast filtering. Vectors reside in GPU VRAM, linked to CPU positions via index tables.

Retrieval Flow (Pre‑filter)

Scalar Filtering : CPU uses inverted indexes to produce a list of candidate IDs.

Similarity Computation : GPU reads the corresponding vectors and computes distances, returning the top‑K IDs.

Result Assembly : CPU fetches full records for the top‑K IDs.

IVF Approximate Search

For scenarios tolerating lower recall, IVF clusters vectors, builds per‑cluster inverted indexes, and limits similarity computation to the nearest N clusters, reducing FLOPs.

Performance Optimizations

High concurrency via CUDA streams – GPU utilization reaches 100 % under load.

Partial scalar filtering on GPU: scalar predicates are executed in the same kernel as vector distance, exploiting GPU parallelism.

Handle‑based resource pool: pre‑allocated GPU buffers and streams are reused across requests, preventing resource exhaustion.

Multi‑GPU parallelism: data is sharded across GPUs on a single server, avoiding network overhead of multi‑node setups.

FP16 precision: switching from FP32 to FP16 halves VRAM usage; recall drops only from 100 % to 99.4 % while enabling ~10 billion vectors on a single machine.

Engineering Deployment

The system consists of online serving and offline data pipelines; the architecture diagram (see image) shows CPU‑side scalar indexing, GPU‑side vector storage, and handle management.

Results

After deployment on >100 million items, the GPU system achieved:

Recall improvement from 85 % to 99.4 %.

Latency reduction: Tp99 down 89 %, Tp999 down 88 %.

Benchmarks on a single A30 GPU confirmed that pre‑filtering outperforms post‑filtering for the same recall level.

Future Work

Support real‑time incremental indexing (currently T+1 batch builds).

Add HNSW support for low‑filter scenarios to boost performance.

Explore NPU and other emerging hardware accelerators.

performance optimization Vector Search Faiss GPU Meituan ANN Mixed Retrieval

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.