How Baidu’s Generative Recall System (COBRA) Revolutionizes Ad Recommendations
This article details Baidu's generative recommendation ad recall framework, introducing the COBRA system and its three development stages—dense representation compression, sparse quantization with ID generation, and dense‑sparse cascading—highlighting coarse‑to‑fine inference, performance gains, long‑sequence extensions, online deployment, and future research directions.
Introduction
With the rapid growth of big data, generative recommendation ad recall has become increasingly important. Ji Zhi, senior algorithm engineer at Baidu and head of the information‑flow ad recall direction, presented Baidu's generative ad recall technology, including the core COBRA system accepted at NeurIPS 2025.
Outline of the Talk
Background of generative recommendation ad recall
Stage 1: Dense representation compression and generative contrastive learning
Stage 2: Sparse quantization and Sparse ID generation
Stage 3: Sparse‑Dense cascading representation and integrated generative‑metric modeling (COBRA)
Core breakthroughs: Coarse‑to‑Fine inference and integrated metric analysis
Long‑sequence extension and online inference
Future planning
Q&A
Background
Traditional retrieval systems use a cascading funnel structure, where recall determines downstream throughput. Scaling laws show that merely increasing model size yields diminishing returns, while user behavior sequences contain valuable information that Transformers can model efficiently. Combining generative methods with behavior sequences offers significant potential.
Stage 1 – Dense Representation Compression & Generative Contrastive Learning
Items (ads) contain multi‑modal information such as industry, title, promotion points, landing pages, and media. The goal is to model the sequence of items generatively to predict the next item a user may be interested in. Directly applying large language models faces two main challenges: (1) extremely long sequences (e.g., 30 items × 300 tokens = 9,000 tokens) causing high computational cost, and (2) information loss when representing items with short text.
Core Work 1 – Item Representation Optimization : Learn token‑level representations for each item to preserve complete information.
Core Work 2 – Sequence Modeling : Different representations require different sequence‑modeling designs.
The dense stage uses a Transformer encoder to embed each item, inserting a [cls] token that interacts with all other tokens to produce the item representation. The resulting item embeddings form an embedding sequence.
Generative contrastive learning feeds the embedding sequence into a causal decoder, predicting the next item as a positive sample while using in‑batch negative sampling to construct a contrastive loss. The encoder and decoder are jointly trained to capture both semantic and collaborative relationships.
Special considerations include time embedding for irregular intervals between user actions, which differs from fixed‑interval token embeddings in standard NLP.
Stage 2 – Sparse Quantization & Sparse ID Generation
The second stage draws inspiration from the Tiger framework. Items are first embedded using models such as ERNIE or dual‑tower/graph methods to obtain collaborative vectors. These vectors are then quantized via techniques like RQ‑VAE into hierarchical ID tuples (e.g., (L1=6, L2=1, L3=4)). The quantized IDs form a compressed ID sequence.
During modeling, the ID sequence is fed into a causal decoder with next‑ID prediction, learning user interest transitions in the compressed space. In inference, given a user’s preceding ID sequence, the model autoregressively generates the next likely ID tuple.
Advantages include full‑information preservation and reduced computational load, while challenges involve hyper‑parameter sensitivity (e.g., codebook size) and potential information loss from quantization.
Stage 3 – Sparse‑Dense Cascading Representation (COBRA)
COBRA integrates sparse and dense representations. Sparse IDs capture high‑level categorical information, while dense vectors encode fine‑grained user preferences. The sequence alternates between sparse and dense tokens, enabling the model to predict both at each timestep.
Key innovations:
Joint sparse‑dense cascading representation learned via codebooks.
Alternating sequence learning where the model predicts the next sparse ID and dense vector, allowing each to complement the other.
End‑to‑end dense vector learning to capture high‑level semantics and collaborative signals.
During training, the model learns to predict both representations, and during inference the generated sparse ID is appended to the sequence, followed by dense vector refinement.
Core Breakthrough – Coarse‑to‑Fine Inference
Inference mirrors training: first, a coarse high‑level interest (Sparse ID) is generated using BeamSearch; second, the generated ID is appended to the original sequence, and a fine‑grained dense vector is obtained via another forward pass. Two scores are computed—BeamScore for the coarse interest and NNScore for the fine‑grained representation—and fused by the BeamFusion framework to balance recall precision and diversity.
Effect Analysis
COBRA achieves a 12‑point absolute (36% relative) improvement in recall@800 over previous methods and surpasses Google Tiger SOTA on both production data and public benchmarks. Visualization shows that incorporating Sparse IDs enhances intra‑cluster cohesion and inter‑cluster separation of dense vectors.
Long‑Sequence Extension & Online Inference
To handle longer sequences, Baidu optimizes the input layer by treating proper nouns as single tokens, reduces padding via packing and Flash‑Attention, employs recomputation to trade time for memory, and uses All‑Gather across GPUs for efficient negative sampling. These optimizations allow sequences of several thousand items.
Online inference leverages INT8 quantization, TensorRT, cache TTL tuning, and hierarchical ANN indexing to meet latency and resource constraints.
Future Planning
Future work will continue extending sequence length, scaling model parameters, exploring sparse activation techniques (e.g., NSA‑MOBA), and pursuing an end‑to‑end generative system that unifies recall, creativity, and ranking.
Q&A Highlights
Q1: Retrieval after COBRA inference uses hierarchical indexing—first retrieve the ID, then fetch its vector from ANN.
Q2: Multi‑modal features (e.g., video) are under investigation, with plans to embed them into the ID space.
Q3: Sparse representations add discriminative power even for small item vocabularies.
Q4: COBRA can be adapted for ranking by adjusting the downstream objective and incorporating discrete IDs.
Q5: Joint recall‑ranking modeling yields gains but introduces coupling challenges.
Q6: Sparse IDs enable modeling of high‑level interest transitions by generating categorical IDs before fine‑grained dense refinement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
