Kuaishou’s Recommendation-as-Generation Shatters Content Limits, Boosts Ads

RaG (Recommendation-as-Generation) redefines short‑video recommendation by predicting user interests, converting them into discrete semantic IDs, and generating personalized ads via a multi‑agent pipeline, achieving industrial‑scale deployment for over 400 million daily users and delivering a 1.87 % lift in ad revenue.

Data Party THU
Data Party THU
Data Party THU
Kuaishou’s Recommendation-as-Generation Shatters Content Limits, Boosts Ads

Problem

Traditional short‑video recommendation follows a retrieve‑and‑rank pipeline: user profile → interest modeling → retrieve existing videos → rank & serve. This pipeline cannot serve a user when the desired video does not exist in the content pool.

Recommendation‑as‑Generation (RaG)

RaG replaces the “find video” step with “produce video”. It first predicts a user’s latent interest as discrete semantic identifiers (D‑SIDs) and then generates a personalized video that matches those identifiers. Deployed in Kuaishou’s ad system, it serves >400 million daily active users and yields a +1.870 % ad‑revenue lift over a strong Generative Recommendation Model (GRM) baseline.

Real‑world example

A young male fitness enthusiast receives a customized ad for “beauty‑endorsed protein powder” that aligns with his interests in beauty, fitness and low‑fat diet, demonstrating fine‑grained personalization.

From retrieval to generation

Traditional pipeline: profile → interest → retrieve video → rank → serve.

RaG pipeline: profile → interest semantic IDs (D‑SIDs) → production instructions → video generation → feedback loop.

Core challenges

Semantic bridging : unify discrete recommendation signals with continuous multimodal generation signals.

Industrial‑scale generation : generate high‑quality personalized videos for billions of requests within strict latency constraints.

Disentangled Semantic IDs (D‑SIDs)

D‑SIDs split a video representation into Content SIDs (what the video is about) and Creative SIDs (how it is presented). Videos are encoded with Qwen2.5‑VL‑7B‑Instruct, captioned, and each type is quantized with two‑layer codebooks (8192 codes per layer, four layers total). Collision rate drops from 18.24 % (QARM) to 2.62 %.

Generative Recommendation Model (GRM)

GRM autoregressively predicts D‑SIDs from user profile, behavior sequences and item features. The output is a set of interest‑semantic tokens consumable by downstream generation.

Instruction Model (IM)

IM translates D‑SIDs and ad metadata into shot‑level production instructions (camera actions, voice‑over, music, subtitles, CTA timing). Training uses Gemini2.5 Pro for supervision and Qwen3‑8B for fine‑tuning in three stages: (1) freeze LLM projector, (2) joint fine‑tune projector and LLM, (3) reward‑optimized fine‑tuning. Default configuration: 8 B parameters, 1 M training samples.

Video Generation Agents (VGAs)

VGAs decompose video creation into three agents (visual, audio, effects). Each agent selects actions such as text‑to‑video, image‑to‑video, TTS, BGM, subtitles, or special effects. Reasoning (cross‑modal consistency) and reflection (self‑correction) are limited to at most two rounds to keep latency low. Experiments show VGAs outperform a fixed‑pipeline baseline.

Synergistic Cross‑Domain Reward Learning (SCRL)

SCRL unifies three reward signals:

User Feedback Reward : clicks, likes, purchases, dense engagement estimates.

Interest Alignment Reward : consistency between generated video and predicted D‑SIDs.

Video Quality Reward : visual fidelity, audio‑visual sync, subtitle/CTA alignment.

User feedback is the primary objective; the other two act as constraints. GDPO normalizes reward scales, and PID‑controlled Lagrangian multipliers dynamically adjust constraint weights. Ablation studies confirm each reward contributes positively.

Industrial deployment

RaG uses an “online interest modeling + near‑line generation + latency‑aware service” architecture. GRM predicts D‑SIDs in milliseconds; IM and VGAs generate videos near‑line and cache results. At serving time:

Both content‑ and creative‑SIDs hit → return cached video.

Content‑SID hit, creative‑SID miss → serve content video and asynchronously generate creative variant.

Content‑SID miss → fallback to nearest‑neighbor video and enqueue missing SID for future generation.

This design avoids per‑request generation cost while continuously expanding the personalized video supply.

Online results

RaG achieves a +1.870 % revenue lift over a strong GRM baseline, which itself improves over DLRM by +3.526 %. The gain demonstrates that generative recommendation can translate AIGC capabilities into commercial value at industrial scale.

References

Project page: https://recommendation-as-generation.github.io/

arXiv paper: https://arxiv.org/abs/2606.25496

RaG overall architecture
RaG overall architecture

Code example

来源:机器之心
本文
约4000字
,建议阅读
8
分钟
本文介绍快手RaG新范式,革新短视频推荐,实现规模化商业增收。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Industrial DeploymentD-SIDsGenerative Recommendation ModelMulti-Agent Video GenerationPersonalized Video GenerationRecommendation-as-GenerationReward Learning
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.