Industry Insights 16 min read

How Baidu’s GNOIMI Powers Billion‑Scale Rich Media Retrieval

Baidu’s rich‑media retrieval system combines CNN‑based feature extraction with an Approximate Nearest Neighbor engine called GNOIMI, employing hierarchical clustering, product quantization, and optimized indexing to achieve sub‑millisecond search over billions of images, videos and audio, supporting anti‑spam, recommendation and risk‑control across dozens of services.

Baidu Geek Talk

May 10, 2021

How Baidu’s GNOIMI Powers Billion‑Scale Rich Media Retrieval

Background

Rich‑media (images, video, audio) has become the dominant carrier of online information. CNN‑based feature vectors transform multimedia content into high‑dimensional vectors, enabling similarity search via Euclidean or cosine distance. Because the data volume reaches billions of items, exact brute‑force search is infeasible, so Approximate Nearest Neighbor (ANN) techniques are employed.

System Architecture

The Baidu rich‑media retrieval system consists of offline ANN training & indexing and online feature extraction & search. Core services are:

cnn-service : extracts CNN features for images, videos and audio.

feature-service : provides a unified feature‑extraction API and caches heterogeneous CNN outputs.

vs-image : obtains query features from feature‑service, calls the ANN service, and performs video/audio‑level verification.

bs : ANN index service that returns top‑k candidates and performs visual re‑ranking; supports automatic shard updates and scaling.

as : merges results from multiple shards and heterogeneous indexes.

finger-builder : ingestion entry that extracts CNN features and writes them to storage.

index-builder : periodically builds ANN indexes.

Key Technologies

Approximate Nearest Neighbor (ANN)

ANN reduces search complexity by partitioning the vector space into many sub‑spaces and limiting traversal to a few of them, achieving sub‑linear time. Common families include:

Tree‑based: KD‑Tree, Annoy

Hash‑based: LSH, PCA‑H

Vector‑quantization: PQ, OPQ

Inverted‑index: IVF, IMI, GNO‑IMI

Graph‑based: NSW, HNSW, NSG

GNOIMI (Generalized Non‑Orthogonal Inverted Multi‑Index)

GNOIMI is Baidu’s internally developed ANN engine. It performs a two‑level hierarchical clustering: first‑level centroids define coarse cells, and a shared set of second‑level centroids (codebooks) refines each cell. The non‑orthogonal design yields flexible cell shapes that adapt to data density, dramatically reducing memory while preserving recall.

Training Pipeline

Randomly sample up to 5 million unique vectors from the raw dataset (no duplicates).

Run K‑means to obtain first‑level centroids.

Compute residuals of the samples with respect to their first‑level centroids and cluster the residuals again to obtain second‑level centroids (codebooks).

Assign each sample to a cell defined by its first‑ and second‑level centroids.

Index Construction

For every raw feature vector, compute its cell assignment (first‑ and second‑level centroids).

Quantize residual vectors using Product Quantization (PQ): split each vector into sub‑spaces, map each sub‑space to the nearest codebook entry, and store only the centroid IDs (byte‑level) instead of full floats.

Persist only the centroid IDs; the original high‑dimensional vectors are never kept in memory.

Search Procedure

Normalize the query feature.

Compute distances between the query and all first‑level centroids; sort them.

For the top gnoimi_search_cells first‑level centroids, compute distances to their associated second‑level centroids (total gnoimi_search_cells * gnoimi_fine_cells_count candidates) and sort.

Use a priority queue to traverse second‑level centroids, retrieve the samples belonging to each centroid, and compute exact distances until neighbors_count results are collected.

Return the top‑K samples with their distances.

Implementation Optimizations

Redesigned training pipeline that reorganizes first‑ and second‑level centroids, boosting training speed by ~10× with a slight recall gain.

Exploited the triangle inequality in L2/COS spaces to prune distance calculations during indexing, cutting computation by >95% and accelerating index building by >5.5×.

Reduced memory consumption to ~10% of Faiss‑IVF* and nmslib‑HNSW.

Optimized cell‑level distance computation and PQ quantization, keeping per‑query latency under 2 ms for millions of centroids and improving overall throughput by >30%.

HNSW Comparison

HNSW (Hierarchical Navigable Small World) is a graph‑based ANN algorithm. Baidu’s optimized HNSW implementation, built on the open‑source version, achieves ~3.6× higher performance. Benchmark results (same query load) show:

Baseline open‑source HNSW: 900 QPS, 25 CPU cores, 66 GB memory.

Optimized version: 900 QPS, 1.6 CPU cores, 80 GB memory.

Application Scenarios

B‑side anti‑spam : full coverage of uploaded and crawled videos, filtering >60% duplicate videos daily with high accuracy.

C‑side deduplication : improves user experience and protects original content.

Related recommendation : links short clips to full‑length videos, enriching recommendation pools.

Risk control : identifies and blocks politically sensitive or pornographic content.

Summary

The system combines CNN feature extraction, hierarchical ANN via GNOIMI, product quantization, and extensive engineering optimizations to deliver high recall, sub‑millisecond latency, and low memory usage at a scale of over a trillion feature vectors. It powers anti‑spam, deduplication, recommendation, and risk‑control services across Baidu’s product ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vector search HNSW large-scale indexing multimedia retrieval ANN GNOIMI

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.