Evolution of Recall Indexes in Alibaba Advertising: From Quantization to Graph-based HNSW
Alibaba’s advertising pipeline progressed from low‑dimensional quantization partitions to hierarchical tree indexes, then to graph‑based HNSW structures—including multi‑category, multi‑level graphs and a BlazeOp‑driven scoring service—dramatically boosting recall efficiency, scalability and maintainability while meeting strict latency constraints.
Recall is the first stage of search/recommendation/advertising pipelines, aiming to select a fixed‑size subset from a massive candidate pool so that the most relevant items are passed to downstream ranking.
In Alibaba’s advertising scenario the candidate set can reach tens of millions or even hundreds of millions, making exhaustive search infeasible under strict latency constraints. Therefore, indexing structures are built to prune the search space while tolerating a modest loss in recall.
This article describes the evolution of Alibaba‑Mama’s recall indexes and the engineering solutions adopted at each step.
1. Partition (Quantization) Index – The simplest approach quantizes each dimension, dividing the vector space into a grid of partitions. For a 2‑D example with 5‑level quantization per axis, the space is split into 25 cells; only vectors in the selected cells are examined, reducing the number of distance calculations by a factor of 2.6.
Although effective for low‑dimensional, uniformly distributed data, quantization suffers from exponential growth of partitions in high dimensions, non‑uniform data distributions, and difficulty handling complex scoring functions.
2. Tree‑based Index – To overcome quantization limits, a hierarchical clustering (tree) index is used. Early versions employed a binary tree (≈6‑7 levels) stored on the inference engine, incurring many RPC calls. Later, a multi‑way tree expressed with TensorFlow splits was embedded in the model, eliminating RPC overhead. However, tree partitions are mutually exclusive, which restricts the ability to represent multiple overlapping item attributes.
3. Graph Index (HNSW) – A graph‑based index removes the exclusivity constraint. Each node represents a vector and edges encode similarity. Retrieval is performed by a beam‑search that starts from randomly chosen nodes. In TensorFlow the graph is represented with a ragged_tensor adjacency list, e.g.:
ragged_tensor = [[1,5,7], [0,2,4,6], [1,3,4], [2,5,6,7], [1,2], [0,3], [1,3,7], [0,3,6]]
Expanding from nodes {6,7} is implemented as:
unique(gather(ragged_tensor, [6,7])) = {0,1,3,6,7}
To avoid revisiting nodes, a bitmap stored in a TensorFlow Variable tracks visited vertices, turning set‑difference operations into fast bitwise checks.
4. Multi‑Category Multi‑Level HNSW – When multiple product categories with hierarchical relationships need to be searched simultaneously, separate HNSW graphs are merged into a single large graph composed of isolated sub‑graphs. Nodes may map to the same ad, and beam‑search starts from the sub‑graphs of the selected categories, preserving per‑category quotas while enabling parallel traversal.
5. Graph‑enabled Model Service (BlazeOp) – The scoring part of the model is encapsulated in a BlazeOp , separating it from the indexing logic. This modularity simplifies algorithm updates, allows independent optimization of the scoring graph, and circumvents hardware constraints such as the single‑engine‑op limit on NPU systems.
The combined approach improves recall efficiency, scalability, and maintainability of large‑scale advertising retrieval systems.
References: [1] Zhu et al., “Learning Tree‑based Deep Model for Recommender Systems”, KDD 2018. [2] Chen et al., “Approximate Nearest Neighbor Search under Neural Similarity Metric for Large‑Scale Recommendation”, CIKM 2022.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.