Artificial Intelligence 17 min read

How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment

The OneRec system from Kuaishou replaces traditional cascade recommendation pipelines with an encoder‑decoder architecture, leverages reward‑based preference alignment via reinforcement learning, achieves ten‑fold FLOPs gains, cuts operational costs by 90%, and delivers significant user‑engagement improvements across short‑video and local‑service scenarios.

Kuaishou Tech

Jun 20, 2025

How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment

Recently, Kuaishou's recommendation model team introduced OneRec, an end‑to‑end generative recommendation system that adopts an encoder‑decoder framework and incorporates a reward‑driven preference alignment method enhanced by reinforcement learning. The system directly generates video recommendations that match user preferences, achieving a ten‑fold increase in FLOPs efficiency while reducing communication and storage costs by nearly 90%.

Main Contributions

Single‑stage encoder‑decoder generation framework: The encoder compresses the full lifecycle of user behavior sequences for precise interest modeling, while a Mixture‑of‑Experts (MoE) decoder provides massive parameter scalability for short‑video recommendation.

Reward‑based preference alignment: A multi‑dimensional reward system (preference, format, industrial) guides the model via reinforcement learning, enabling fine‑grained capture of user preferences.

First industrial‑grade end‑to‑end generative recommendation deployment: Deployed on both Kuaishou main and Lite versions, a 5% traffic A/B test showed the pure generative model matches the performance of the complex traditional pipeline, and with reward‑model selection it increased app stay time (+0.54% on main, +1.24% on Lite) and 7‑day user lifetime value (+0.05% / +0.08%).

The system also explores scaling laws for recommendation models, demonstrating that loss consistently decreases as parameters grow from 0.015B to 2.633B, mirroring trends observed in large language models.

RL Preference Alignment

OneRec builds a comprehensive reward system comprising preference rewards (to align with user tastes), format rewards (to ensure valid token generation), and industrial rewards (to satisfy business‑specific goals). Using an improved ECPO algorithm, the model stabilizes training and avoids gradient explosion.

Online experiments reveal that reinforcement learning improves user engagement without sacrificing exposure volume, and format rewards mitigate the “squeeze effect” that otherwise reduces output legality.

Performance Optimizations

OneRec dramatically reduces the number of operators from over 15,000 to ~1,200, raising training MFU to 23.7% and inference MFU to 28.6%—a 3‑5× improvement over traditional models. Optimizations include computation sharing across beams, flash‑attention, embedding acceleration on GPUs, mixed‑precision BFloat16 training, and dynamic batching for inference.

These engineering advances enable OneRec to achieve near‑LLM level compute efficiency while delivering substantial business gains: a 90% reduction in operational cost, 21% GMV uplift in local‑service scenarios, and consistent improvements across all interaction metrics.

Future Directions

Enhance inference scalability for larger step counts.

Integrate multimodal bridging to unify user behavior, video content, and large‑model representations.

Develop a more sophisticated reward system to further guide model behavior.

Overall, OneRec demonstrates that generative, end‑to‑end architectures can surpass traditional recommendation pipelines in both effectiveness and efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation system generative modeling reinforcement learning Scaling Laws Kuaishou OneRec

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.