Artificial Intelligence 18 min read

How OneRec Revolutionizes Short-Video Recommendations with End-to-End Generative AI

OneRec, an end-to-end generative recommendation system from Kuaishou, uses an encoder-decoder architecture, reward-based preference alignment, and reinforcement learning to dramatically improve video recommendation efficiency, boosting user engagement and reducing operational costs while achieving scaling-law performance comparable to large language models.

Kuaishou Large Model

Jun 20, 2025

How OneRec Revolutionizes Short-Video Recommendations with End-to-End Generative AI

Overview

Recently, Kuaishou's recommendation model team introduced OneRec, an end-to-end generative recommendation system that adopts an encoder‑decoder framework and a reward‑driven preference alignment method. Reinforcement learning enhances the model, allowing it to generate video content that directly matches user preferences. OneRec has been deployed on both Kuaishou and Kuaishou Lite, handling 25% of online traffic and increasing app dwell time by 0.54% (main app) and 1.24% (Lite).

Key Contributions

Single‑stage encoder‑decoder generation framework : The encoder compresses the full‑lifecycle user behavior sequence for precise interest modeling, while a Mixture‑of‑Experts (MoE) decoder provides massive parameter scalability.

Reward‑based preference alignment : A multi‑dimensional reward system (preference, format, industrial) guides the model via reinforcement learning, enabling fine‑grained capture of user preferences.

First industrial‑grade end‑to‑end generative recommendation deployment : In a week‑long A/B test covering 5% of traffic, the pure generative model achieved performance comparable to the traditional cascade system, and with reward‑model selection it further improved dwell time (+0.54% / +1.24%) and 7‑day user lifetime (LT7) (+0.05% / +0.08%).

System Architecture

OneRec treats recommendation as a sequence generation task. The encoder processes user static features, short‑term and lifelong behavior sequences, and multimodal video signals (title, tags, ASR, visual embeddings). The decoder generates token sequences that map to video IDs. The architecture leverages flash‑attention and shared context computation to reduce redundancy.

Semantic Tokenizer

A collaborative multimodal tokenizer fuses video title, tags, speech‑to‑text, and image recognition, then applies RQ‑Kmeans to produce three‑level semantic IDs for each video.

Reinforcement Learning Preference Alignment

The system defines three reward types: Preference Reward (aligns with user preference), Format Reward (ensures valid token formats), and Industrial Reward (covers business‑specific goals). An improved ECPO algorithm stabilizes training by clipping gradients for negative‑advantage samples.

Scaling Laws

Experiments show that increasing model parameters from 0.015B to 2.633B consistently reduces training loss, indicating that recommendation models follow the same scaling behavior observed in large language models.

Performance Optimizations

OneRec reduces the number of operators from >15,000 to ~1,200, achieving MFU (model‑floating‑point‑utilization) of 23.7% (training) and 28.6% (inference), a 3‑5× improvement over traditional models. Optimizations include request‑level batching, flash‑attention, GPU‑only embedding training (SKAI system), mixed‑precision BFloat16, and kernel fusion.

Inference Optimizations

Computation reuse: encoder computed once per request; decoder cross‑attention keys/values shared across beams; KV cache for history.

Operator‑level fusion for MoE, attention, and beam search using Float16.

Dynamic batching to maximize GPU utilization.

Online Experiment Results

In a week‑long A/B test covering 5% of traffic, OneRec with reward‑model selection increased main‑app dwell time by 0.54% and Lite dwell time by 1.24%, while LT7 grew by 0.05% (main) and 0.08% (Lite). All interaction metrics (likes, follows, comments) showed positive lifts, confirming the system’s ability to avoid the “trade‑off” effect of multi‑objective traditional pipelines. The model now serves 25% of QPS in short‑video recommendation.

In the local‑life services scenario, OneRec boosted GMV by 21.01%, order volume by 17.89%, and new‑user acquisition by 23.02% after full traffic rollout.

Conclusion and Future Directions

OneRec demonstrates that an end‑to‑end generative architecture, combined with deep system optimizations, can surpass traditional cascade recommendation pipelines in both effectiveness and efficiency. Remaining challenges include improving inference scalability, integrating multimodal user behavior with LLM/VLM paradigms, and designing a more comprehensive reward system.

Recruitment

The Kuaishou recommendation model team is hiring for positions such as Recommendation Large‑Model Algorithm Engineer, Recommendation Algorithm Engineer, and related internship roles. Interested candidates can submit resumes to the provided email addresses.

efficiency recommendation system large models reinforcement learning scaling laws Generative AI Kuaishou

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.