How Baidu’s GRAB Model Uses Scaling Laws to Transform Ad Ranking
This article explains Baidu's generative ranking model GRAB, detailing how scaling laws from large language models inspire a new recommendation paradigm, the model's architecture, custom attention mechanisms, training strategies, deployment optimizations, and the resulting business gains in CTR and revenue.
Introduction
Recent breakthroughs in generative AI, especially large language models (LLMs), have demonstrated the "Scaling Law" phenomenon, where model performance grows predictably with parameters, data, and compute. In the demanding ad recommendation scenario, traditional deep learning ranking models (DLRMs) face performance bottlenecks.
The Baidu Commercial Technology team designed and fully deployed a generative ranking model called GRAB (Generative Ranking for Ads at Baidu) to overcome these limits, covering problem diagnosis, paradigm exploration, framework design, technical challenges, and business outcomes.
1. Trend of Large‑Model Recommendation
Before GRAB, Baidu's ad recommendation relied on classic DLRMs that combine massive discrete features with MLPs. While effective, this paradigm hits a ceiling due to diminishing returns from feature engineering, lossy compression of sequential representations, weak reasoning, and low activation rates for dynamic ad scenarios.
2. "Scaling Law": A Breakthrough Insight
LLM research shows that loss decreases linearly as model size grows, suggesting that scaling up recommendation models could yield continuous gains. This insight motivated the exploration of large‑model approaches for recommendation.
3. Three Paths Explored for Large‑Model Recommendation
Path 1 – Direct LLM Recommendation: Directly applying a generic LLM to ad data failed, with performance dropping over a percentile.
Path 2 – LLM‑Enhanced Representations: Using LLMs to generate high‑quality feature embeddings improved generalization but offered limited short‑term gains.
Path 3 – Generative Sequential Modeling: Adapting LLM techniques (Transformer, long‑context) to model user behavior sequences end‑to‑end proved effective and led to the GRAB framework.
GRAB Overall Design
1. Core Design Philosophy
From "Separate" to "Unified": Model history behavior and target ad in a shared representation space, similar to LLM token modeling.
From "Flat" to "Structured": Transform user behavior into structured sequences handling variable length and hierarchy.
From "Manual" to "Adaptive": Feed raw user sequences directly, letting the model learn without handcrafted features.
From "Sequence Retrieval" to "Efficient Attention": Replace traditional hard‑search with causal Transformer attention for full‑sequence modeling.
2. Framework
GRAB treats the concatenated user history and candidate ad as a unified event sequence. Each event is tokenized via a GATE + MLP layer, then processed by a causal attention Transformer. The Transformer output passes through an MLP and Sigmoid to predict click‑through rate (CTR) for each ad slot.
3. Comparison with LLM and DLRM
GRAB shares the Transformer backbone with LLMs but focuses on user‑behavior tokens and discriminative learning rather than generative language objectives. Compared to DLRM, GRAB replaces handcrafted feature tables with end‑to‑end sequence modeling, achieving a full‑pipeline innovation.
Challenges and Solutions
1. Customized Attention Mechanism
Standard Transformer attention cannot directly handle recommendation’s complex interaction and temporal signals. The solution is Q‑Aware RAB (Query‑aware Relative Attention Bias) which combines causal masking, dual sliding windows (time and length), and query‑dependent relative biases.
2. Training Efficiency & Over‑fitting
Variable‑Length Zero‑Redundancy Packing: Pack multiple user sequences together with masks to improve GPU utilization.
Two‑Stage Training (STS): First stage learns end‑to‑end sequence autoregression; second stage trains sparse discrete representations, mitigating over‑fitting caused by user‑interest locality.
3. Inheriting the "Old Soup" Model
To warm‑start GRAB, static user attributes are encoded as heterogeneous tokens and combined only when needed, reducing redundancy. A dual‑loss training (original DLRM loss + GRAB sequence loss) enables smooth migration.
4. Efficient Online Inference
KV‑Cache: Cache key/value vectors of user history for fast per‑request inference.
System & Algorithm Optimizations: Use M‑Falcon packing, operator fusion, low‑precision computation, and cache‑aware serving to keep inference cost comparable to traditional models.
Business Impact
GRAB was fully deployed in Baidu’s ad ranking, delivering:
~0.003 % AUC lift.
~4 % revenue increase.
~5 % click‑through‑rate improvement.
Experiments also confirmed the scaling law: extending user sequence length from 64 to over 1024 yields near‑linear AUC growth, validating the long‑term potential of generative recommendation.
Future Outlook
The next generation of recommendation systems should combine broader knowledge, multimodal inputs, and rapid adaptation. Baidu envisions a path from rule‑based to hybrid to fully generative systems, where "recommendation large‑modelization" and "large‑model recommendation" converge.
Q&A Highlights
Key takeaways include the distinction between emergent phenomena and capabilities, the applicability of scaling laws to recommendation, practical training pipeline changes, handling heterogeneous tokens, and deployment strategies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
