GPU-Accelerated Mixed Vector-Scalar Retrieval System for Meituan Takeaway Search
Meituan Waimai’s search team built a GPU‑accelerated, mixed vector‑and‑scalar retrieval engine that supports billions of items, achieving over 99% recall and up to 89% latency reduction by combining pre‑filtering, optimized data layouts, multi‑GPU parallelism, and FP16 precision.
Background
With the rise of big data and AI, vector search has become essential for recommendation, Q&A, and large‑language‑model applications. Traditional brute‑force search is infeasible at large scale, so Approximate Nearest Neighbor (ANN) methods such as HNSW, IVF, IVF‑PQ, and IVF‑PQ+Refine are adopted.
Evolution of Meituan Waimai Vector Indexes
HNSW
Hierarchical Navigable Small World builds a layered graph for efficient high‑dimensional search but suffers performance loss under high filter ratios, which is problematic for the strong LBS filtering in Meituan Waimai.
IVF
Inverted File partitions the vector space into clusters, reducing the number of vectors examined but requires storing the full vector set in memory.
IVF‑PQ
Product Quantization compresses vectors into sub‑vectors, saving memory at the cost of some recall loss.
IVF‑PQ+Refine
Stores raw vectors on SSD, recalls a larger candidate set with IVF‑PQ, then refines using the original vectors to regain accuracy.
Geo‑aware Retrieval
Geohash encoding of merchant locations allows incorporating distance into similarity scoring, improving recall for location‑based services.
Goals and Challenges
Support mixed vector + scalar retrieval with high‑filter ratios (>99%).
Maintain recall ≥95% while handling candidate sets of >1 billion vectors.
Keep 99th‑percentile latency (Tp99) under 20 ms.
Scale to >100 million candidates.
Existing GPU‑based solutions (Faiss, Milvus) excel at pure vector search but lack mixed retrieval capabilities, prompting a custom design.
Solution Exploration
Pre‑filter vs. Post‑filter
Pre‑filter applies scalar filters first, then vector search; post‑filter does the opposite, requiring an expansion factor N to compensate for filtered‑out results.
Post‑filter was initially favored for low development risk, but benchmarks showed insufficient recall for Flat, IVF, and IVFPQ without large expansion factors.
Pre‑filter Implementation Options
Store all raw data in GPU memory and perform both scalar filtering and vector computation on GPU – rejected due to high memory consumption and poor filter performance.
Keep all data in CPU memory, perform scalar filtering on CPU, then copy filtered vectors to GPU – rejected because data transfer latency (PCIe Gen4 64 GB/s) could not meet the latency target.
Store vectors in GPU memory, scalar attributes in CPU memory; perform scalar filtering on CPU, then pass index lists to GPU for vector computation – selected.
GPU Vector Retrieval System
Data Layout
Scalar fields are column‑stored in CPU memory with inverted indexes for fast filtering. Vectors reside in GPU VRAM, linked to CPU positions via index tables.
Retrieval Flow (Pre‑filter)
Scalar Filtering : CPU uses inverted indexes to produce a list of candidate IDs.
Similarity Computation : GPU reads the corresponding vectors and computes distances, returning the top‑K IDs.
Result Assembly : CPU fetches full records for the top‑K IDs.
IVF Approximate Search
For scenarios tolerating lower recall, IVF clusters vectors, builds per‑cluster inverted indexes, and limits similarity computation to the nearest N clusters, reducing FLOPs.
Performance Optimizations
High concurrency via CUDA streams – GPU utilization reaches 100 % under load.
Partial scalar filtering on GPU: scalar predicates are executed in the same kernel as vector distance, exploiting GPU parallelism.
Handle‑based resource pool: pre‑allocated GPU buffers and streams are reused across requests, preventing resource exhaustion.
Multi‑GPU parallelism: data is sharded across GPUs on a single server, avoiding network overhead of multi‑node setups.
FP16 precision: switching from FP32 to FP16 halves VRAM usage; recall drops only from 100 % to 99.4 % while enabling ~10 billion vectors on a single machine.
Engineering Deployment
The system consists of online serving and offline data pipelines; the architecture diagram (see image) shows CPU‑side scalar indexing, GPU‑side vector storage, and handle management.
Results
After deployment on >100 million items, the GPU system achieved:
Recall improvement from 85 % to 99.4 %.
Latency reduction: Tp99 down 89 %, Tp999 down 88 %.
Benchmarks on a single A30 GPU confirmed that pre‑filtering outperforms post‑filtering for the same recall level.
Future Work
Support real‑time incremental indexing (currently T+1 batch builds).
Add HNSW support for low‑filter scenarios to boost performance.
Explore NPU and other emerging hardware accelerators.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
