How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment
The OneRec system from Kuaishou replaces traditional cascade recommendation pipelines with an encoder‑decoder architecture, leverages reward‑based preference alignment via reinforcement learning, achieves ten‑fold FLOPs gains, cuts operational costs by 90%, and delivers significant user‑engagement improvements across short‑video and local‑service scenarios.
Recently, Kuaishou's recommendation model team introduced OneRec, an end‑to‑end generative recommendation system that adopts an encoder‑decoder framework and incorporates a reward‑driven preference alignment method enhanced by reinforcement learning. The system directly generates video recommendations that match user preferences, achieving a ten‑fold increase in FLOPs efficiency while reducing communication and storage costs by nearly 90%.
Main Contributions
Single‑stage encoder‑decoder generation framework: The encoder compresses the full lifecycle of user behavior sequences for precise interest modeling, while a Mixture‑of‑Experts (MoE) decoder provides massive parameter scalability for short‑video recommendation.
Reward‑based preference alignment: A multi‑dimensional reward system (preference, format, industrial) guides the model via reinforcement learning, enabling fine‑grained capture of user preferences.
First industrial‑grade end‑to‑end generative recommendation deployment: Deployed on both Kuaishou main and Lite versions, a 5% traffic A/B test showed the pure generative model matches the performance of the complex traditional pipeline, and with reward‑model selection it increased app stay time (+0.54% on main, +1.24% on Lite) and 7‑day user lifetime value (+0.05% / +0.08%).
The system also explores scaling laws for recommendation models, demonstrating that loss consistently decreases as parameters grow from 0.015B to 2.633B, mirroring trends observed in large language models.
RL Preference Alignment
OneRec builds a comprehensive reward system comprising preference rewards (to align with user tastes), format rewards (to ensure valid token generation), and industrial rewards (to satisfy business‑specific goals). Using an improved ECPO algorithm, the model stabilizes training and avoids gradient explosion.
Online experiments reveal that reinforcement learning improves user engagement without sacrificing exposure volume, and format rewards mitigate the “squeeze effect” that otherwise reduces output legality.
Performance Optimizations
OneRec dramatically reduces the number of operators from over 15,000 to ~1,200, raising training MFU to 23.7% and inference MFU to 28.6%—a 3‑5× improvement over traditional models. Optimizations include computation sharing across beams, flash‑attention, embedding acceleration on GPUs, mixed‑precision BFloat16 training, and dynamic batching for inference.
These engineering advances enable OneRec to achieve near‑LLM level compute efficiency while delivering substantial business gains: a 90% reduction in operational cost, 21% GMV uplift in local‑service scenarios, and consistent improvements across all interaction metrics.
Future Directions
Enhance inference scalability for larger step counts.
Integrate multimodal bridging to unify user behavior, video content, and large‑model representations.
Develop a more sophisticated reward system to further guide model behavior.
Overall, OneRec demonstrates that generative, end‑to‑end architectures can surpass traditional recommendation pipelines in both effectiveness and efficiency.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.