From Tokens to Revenue: Kuaishou’s GR4AD Pioneers Full‑Stack Generative Recommendation for Ads

GR4AD, Kuaishou’s generative recommendation system, redesigns the entire ad pipeline—from tokenizing multimodal ad material to value‑aware learning, lazy decoding, and dynamic beam search—delivering over 4 % revenue lift, higher eCPM, and sub‑100 ms latency for more than 400 million users.

Machine Heart
Machine Heart
Machine Heart
From Tokens to Revenue: Kuaishou’s GR4AD Pioneers Full‑Stack Generative Recommendation for Ads

1. Introduction: A New Paradigm for Recommendation

Deep‑learning recommendation models have dominated industry for a decade, but the rise of large language models (LLMs) raises the question: can we "generate" recommendation results directly, as we generate text? This is the core idea of generative recommendation. Prior works such as TIGER and OneRec proved feasibility in natural recommendation scenarios, yet applying the paradigm to large‑scale advertising—where latency, revenue, and commercial value are critical—poses additional challenges.

Fast‑hand’s paper introduces GR4AD (Generative Recommendation for ADvertising), a system that spans representation, learning, and service layers, fully deployed on its ad platform and serving over 400 million users.

2. Problems and Challenges in Advertising

The authors identify three unique challenges that make a direct LLM transfer infeasible.

Challenge 1: Tokenizing Ad Material – Ads combine video creatives, product details, and B‑side metadata, plus conversion types and account signals that have high commercial value but little semantic content. A unified token system must capture both semantics and business signals.

Challenge 2: Learning Paradigm – The objective is to maximize list‑level metrics such as eCPM and NDCG, not just click‑through prediction. Existing generative methods follow LLM‑style staged training and lack list‑level, value‑aware optimization.

Challenge 3: Real‑time Service – Advertising requires sub‑100 ms latency and extremely high QPS, demanding generation of many high‑quality candidates via Beam Search, a problem distinct from LLM decoding.

3. Methodology: End‑to‑End Collaborative Design

3.1 Unified Advertising Semantic ID (UA‑SID)

The system first creates a unified embedding for each ad using an instruction‑tuned multimodal LLM (MLLM). Six prompt templates guide the model to understand different ad formats (live, product, influencer, etc.). Co‑occurrence learning with the Swing method and an InfoNCE contrastive objective injects collaborative signals into the representation.

Next, a Multi‑Granularity‑Multi‑Resolution (MG‑MR) RQ‑Kmeans quantization converts the embedding into a discrete Semantic ID. The multi‑resolution codebook uses larger codebooks at lower levels to capture dominant semantics and smaller ones at higher levels for residuals. Multi‑granularity replaces the final vector quantization with a hash of non‑semantic features (conversion type, account ID), reducing SID collisions caused by identical content with different delivery strategies.

Each ad material is thus mapped to a discrete UA‑SID sequence.

3.2 LazyAR: Lazy Decoder

Standard autoregressive decoding requires each token to depend on the previous one, which becomes a bottleneck when Beam width is large. The authors observe that the first SID layer is hardest to learn yet uses a Beam of 1, while later layers are easier but have exponentially growing Beam sizes.

LazyAR postpones the dependency on the previous token to a later layer K. The first K layers compute in parallel using only positional encodings and context X, allowing all Beam candidates to share these computations. The remaining L‑K layers perform standard autoregressive decoding after injecting the previous SID embedding.

An auxiliary MTP loss forces the parallel layers to learn useful representations. Experiments show that, with the same recommendation quality, inference throughput doubles.

This design is recommendation‑native and does not apply to standard LLM decoding, which typically does not use Beam Search.

3.3 Value‑Sensitive Supervision (VSL)

Because ad samples vary widely in commercial value, VSL adds three components:

SID + eCPM joint prediction – discretizes eCPM into buckets and predicts them as additional tokens.

Value‑aware sample weighting – assigns higher weights to high‑value users and deep interaction behaviors (e.g., purchases).

MTP auxiliary loss – works with LazyAR to ensure high‑quality representations in the parallel layers.

The final VSL objective combines these signals.

3.4 Ranking‑Guided Softmax Preference Optimization (RSPO)

VSL fits historical data but does not directly optimize downstream ranking metrics or explore unknown label distributions. RSPO introduces a list‑level NDCG‑oriented reinforcement learning algorithm based on the Lambda framework. The authors prove that RSPO provides an upper bound for NDCG cost, guaranteeing direct optimization of the ranking metric.

4. Online Deployment: Full Closed‑Loop System

GR4AD (0.16 B parameters) is fully deployed with a closed loop: reward estimation → online learning → real‑time indexing → real‑time serving.

4.1 Core Modules

Reward System – trains an independent reward model to score candidate sets with eCPM, allowing larger Beam exploration under relaxed latency constraints.

Online Learning Module – continuously builds VSL and RL signals, performs mini‑batch updates, and pushes parameters to the inference service.

Real‑time Indexing Module – replaces traditional embedding indexes with SID. New material only needs UA‑SID computation and a bidirectional index update, achieving second‑level freshness and improving cold‑start coverage.

Real‑time Service Engine – handles user requests and returns ranked ad lists.

4.2 Inference Efficiency Optimizations

Dynamic Beam Service (DBS) introduces two mechanisms:

Dynamic Beam Width (DBW) – schedules increasing Beam sizes (e.g., 128→256→512) instead of a fixed width, cutting intermediate computation without hurting final candidate quality.

Traffic‑Aware Adaptive Beam Search (TABS) – automatically adjusts Beam size based on real‑time QPS: larger Beam during low traffic to improve quality, smaller Beam during peaks to meet latency and throughput.

Additional engineering tricks include Beam‑shared KV Cache (+212.5 % QPS), Top‑K pre‑pruning (+184.8 % QPS), FP8 low‑precision inference (+50.3 % QPS), and short TTL result caching (+27.8 % QPS). The system achieves <100 ms latency with 500+ QPS per L20 GPU.

5. Experimental Results: Revenue and Performance Gains

5.1 Overall Performance and Ablation

RSPO provides the largest single‑component gain, outperforming DPO and GRPO, confirming the indispensability of list‑level RL in advertising.

LazyAR doubles throughput with negligible accuracy loss, outperforming DeepSeek‑MTP.

DBS further improves efficiency without sacrificing revenue; TABS can even increase revenue during low‑traffic periods.

5.2 Scaling Law

Model size scaling from 0.03 B to 0.32 B yields revenue lifts from +2.13 % to +4.43 % and continuously decreasing training loss, establishing a scaling law for generative ad recommendation.

Increasing Beam width from 128 to 1024 raises revenue from +2.33 % to +4.21 %, showing that stronger inference search unlocks additional model potential, echoing test‑time scaling trends in LLMs.

5.3 UA‑SID Quality

After instruction tuning and co‑occurrence learning, the unified ad embedding (UAE) achieves R@1 = 0.896, far surpassing the QARM baseline (0.541) and the original Qwen3‑VL‑7B (0.769). MG‑MR quantization reduces SID collision rate from 85.44 % to 18.26 % and improves codebook utilization by over threefold.

5.4 Business Metrics

Commercial ad revenue + 4.2 %

Small‑and‑medium advertiser spend + 17.5 %

Ad conversion rate + 10.17 %

Low‑activity user conversion rate + 7.28 %

The content‑based SID also provides stronger generalization and real‑time indexing, supporting cold‑start materials and achieving a three‑way win for the platform, advertisers, and users.

6. Conclusion and Reflections

The paper’s value lies not only in the 4.2 % revenue uplift but also in answering the critical question: how should generative recommendation be designed for the hardest industrial scenario—advertising?

The answer is: do not copy LLMs verbatim; design recommendation‑native solutions.

Tokenization must encode business signals, not just semantic content (UA‑SID + MG‑MR).

Training must move beyond pointwise probability generation to value‑aware list‑level optimization (VSL + RSPO).

Inference must be tailored to short sequences, many candidates, and Beam Search (LazyAR + DBS).

The system must be fully online, with real‑time indexing, online learning, and closed‑loop feedback.

GR4AD marks a milestone for generative recommendation in core advertising scenarios, validated on real traffic from over 400 million users, and is likely to inspire further adoption across ad platforms.

AdvertisingOnline LearningReal-time InferenceGenerative RecommendationRanking Optimization
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.