Artificial Intelligence 27 min read

Dual Vector Foil (DVF): Decoupled Index and Model Retrieval for Large-Scale Recall

The Dual Vector Foil (DVF) system decouples index construction from model training by building a post‑training HNSW graph, enabling any complex model to score candidates, which yields a 5.7 % recall boost, cuts latency from ~40 ms to 6.5 ms, and raises QPS over tenfold while simplifying maintenance.

Alimama Tech

Dec 8, 2021

Dual Vector Foil (DVF): Decoupled Index and Model Retrieval for Large-Scale Recall

With the rapid growth of internet services, companies have accumulated massive high‑quality content. In such a scenario, recall modules—positioned at the front of the recommendation pipeline—are critical because they determine the upper bound of overall service quality.

The core problem of recall is to select a high‑quality, limited‑size subset from an enormous candidate pool. Historically, recall has evolved from heuristic rules to collaborative‑filtering and finally to model‑based approaches. Two mainstream model‑based solutions exist: a two‑stage (two‑tower) architecture that relies on vector inner‑product search, and a one‑stage architecture that jointly learns the index and the model (e.g., the TDM series).

Two‑stage solutions suffer from a mismatch between training and retrieval objectives and are constrained by the inner‑product model structure, limiting their expressive power. One‑stage solutions alleviate these issues but introduce heavy coupling between index construction and model training, making maintenance and rapid iteration difficult.

To address these challenges, the Dual Vector Foil (DVF) algorithm system was proposed. DVF decouples index learning from model training while retaining the ability to use arbitrarily complex models. The name comes from the sci‑fi concept of compressing a three‑dimensional structure into two dimensions, reflecting the goal of keeping model flexibility while simplifying the index.

DVF builds the index post‑training using a Hierarchical Navigable Small World (HNSW) graph, which imposes no constraints on the model’s embedding space. Retrieval proceeds layer‑by‑layer: starting from a set of seed nodes, the HNSW graph is traversed, each visited node is scored by the model, and the top‑K candidates are passed to the next layer. The final layer’s top‑K items constitute the recall result.

The scoring model consists of four components: (1) user‑level aggregated features extracted by a Transformer, (2) user behavior sequence features combined with the target via target‑attention, (3) multi‑layer MLP for target feature extraction, and (4) a final MLP that merges the three feature streams to produce a relevance score.

From an engineering perspective, DVF integrates both the index and the model into a unified inference module, reducing request latency by ~12 ms. Online retrieval runs on CPU, while scoring runs on GPU; custom TensorFlow ops (set‑difference, set‑union, bitmap) and linear‑attention kernels further accelerate the pipeline. XLA auto‑padding is employed to handle dynamic batch sizes without triggering JIT recompilation.

Offline experiments show that removing the inner‑product restriction yields a 5.71 % absolute recall gain, and DVF achieves comparable recall with only 1.9 % of the items scored. Online benchmarks on a T4 GPU demonstrate latency reductions from 39.8 ms to 6.5 ms and QPS improvements from 68 to over 600 after successive optimizations.

In summary, DVF provides a lightweight, high‑performance, and model‑agnostic solution for large‑scale recall, with clear advantages in both accuracy and system efficiency. Future work includes further model upgrades, NPU acceleration, and exploration of alternative graph construction methods.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation Indexing large-scale retrieval dual vector foil online inference

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.