How Baidu’s GNOIMI Powers Billion‑Scale Rich Media Retrieval
Baidu’s rich‑media retrieval system combines CNN‑based feature extraction with an Approximate Nearest Neighbor engine called GNOIMI, employing hierarchical clustering, product quantization, and optimized indexing to achieve sub‑millisecond search over billions of images, videos and audio, supporting anti‑spam, recommendation and risk‑control across dozens of services.
Background
Rich‑media (images, video, audio) has become the dominant carrier of online information. CNN‑based feature vectors transform multimedia content into high‑dimensional vectors, enabling similarity search via Euclidean or cosine distance. Because the data volume reaches billions of items, exact brute‑force search is infeasible, so Approximate Nearest Neighbor (ANN) techniques are employed.
System Architecture
The Baidu rich‑media retrieval system consists of offline ANN training & indexing and online feature extraction & search. Core services are:
cnn-service : extracts CNN features for images, videos and audio.
feature-service : provides a unified feature‑extraction API and caches heterogeneous CNN outputs.
vs-image : obtains query features from feature‑service, calls the ANN service, and performs video/audio‑level verification.
bs : ANN index service that returns top‑k candidates and performs visual re‑ranking; supports automatic shard updates and scaling.
as : merges results from multiple shards and heterogeneous indexes.
finger-builder : ingestion entry that extracts CNN features and writes them to storage.
index-builder : periodically builds ANN indexes.
Key Technologies
Approximate Nearest Neighbor (ANN)
ANN reduces search complexity by partitioning the vector space into many sub‑spaces and limiting traversal to a few of them, achieving sub‑linear time. Common families include:
Tree‑based: KD‑Tree, Annoy
Hash‑based: LSH, PCA‑H
Vector‑quantization: PQ, OPQ
Inverted‑index: IVF, IMI, GNO‑IMI
Graph‑based: NSW, HNSW, NSG
GNOIMI (Generalized Non‑Orthogonal Inverted Multi‑Index)
GNOIMI is Baidu’s internally developed ANN engine. It performs a two‑level hierarchical clustering: first‑level centroids define coarse cells, and a shared set of second‑level centroids (codebooks) refines each cell. The non‑orthogonal design yields flexible cell shapes that adapt to data density, dramatically reducing memory while preserving recall.
Training Pipeline
Randomly sample up to 5 million unique vectors from the raw dataset (no duplicates).
Run K‑means to obtain first‑level centroids.
Compute residuals of the samples with respect to their first‑level centroids and cluster the residuals again to obtain second‑level centroids (codebooks).
Assign each sample to a cell defined by its first‑ and second‑level centroids.
Index Construction
For every raw feature vector, compute its cell assignment (first‑ and second‑level centroids).
Quantize residual vectors using Product Quantization (PQ): split each vector into sub‑spaces, map each sub‑space to the nearest codebook entry, and store only the centroid IDs (byte‑level) instead of full floats.
Persist only the centroid IDs; the original high‑dimensional vectors are never kept in memory.
Search Procedure
Normalize the query feature.
Compute distances between the query and all first‑level centroids; sort them.
For the top gnoimi_search_cells first‑level centroids, compute distances to their associated second‑level centroids (total gnoimi_search_cells * gnoimi_fine_cells_count candidates) and sort.
Use a priority queue to traverse second‑level centroids, retrieve the samples belonging to each centroid, and compute exact distances until neighbors_count results are collected.
Return the top‑K samples with their distances.
Implementation Optimizations
Redesigned training pipeline that reorganizes first‑ and second‑level centroids, boosting training speed by ~10× with a slight recall gain.
Exploited the triangle inequality in L2/COS spaces to prune distance calculations during indexing, cutting computation by >95% and accelerating index building by >5.5×.
Reduced memory consumption to ~10% of Faiss‑IVF* and nmslib‑HNSW.
Optimized cell‑level distance computation and PQ quantization, keeping per‑query latency under 2 ms for millions of centroids and improving overall throughput by >30%.
HNSW Comparison
HNSW (Hierarchical Navigable Small World) is a graph‑based ANN algorithm. Baidu’s optimized HNSW implementation, built on the open‑source version, achieves ~3.6× higher performance. Benchmark results (same query load) show:
Baseline open‑source HNSW: 900 QPS, 25 CPU cores, 66 GB memory.
Optimized version: 900 QPS, 1.6 CPU cores, 80 GB memory.
Application Scenarios
B‑side anti‑spam : full coverage of uploaded and crawled videos, filtering >60% duplicate videos daily with high accuracy.
C‑side deduplication : improves user experience and protects original content.
Related recommendation : links short clips to full‑length videos, enriching recommendation pools.
Risk control : identifies and blocks politically sensitive or pornographic content.
Summary
The system combines CNN feature extraction, hierarchical ANN via GNOIMI, product quantization, and extensive engineering optimizations to deliver high recall, sub‑millisecond latency, and low memory usage at a scale of over a trillion feature vectors. It powers anti‑spam, deduplication, recommendation, and risk‑control services across Baidu’s product ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
