OneReason: Enabling Recommendation Systems to Reason

OneReason introduces a systematic reasoning capability into industrial recommendation models through multi‑stage pre‑training, chain‑of‑thought fine‑tuning, and reinforcement learning, achieving significant gains in click‑through, revenue, and cross‑domain recommendation performance while preserving the underlying language abilities of the base model.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
OneReason: Enabling Recommendation Systems to Reason

Background

Over the past decade, recommendation systems have focused on scaling statistical co‑occurrence between users and items, moving from collaborative filtering to deep models and generative OneRec series. Scaling has delivered large compute gains, but in the LLM era pure scaling hits hard walls such as cold‑start users, long‑tail items, cross‑domain transfer, and multi‑objective weighting.

Why Reasoning Matters

Large language models have progressed from scaling to reasoning and agentic behavior (e.g., OpenAI o1, DeepSeek R1). The same reasoning step is needed in recommendation to unlock a new growth curve, but it cannot be a simple copy of the LLM paradigm.

Three Fundamental Questions for Recommendation Reasoning

Cause‑effect inference : Recommendation is inherently a "reverse‑causality" problem—inferring why a specific item fits the current context from noisy, sparse user behavior.

Explainable, interveneable cognition : A reasoning‑enabled base model should expose the decision chain, allowing business constraints to be written directly into the reasoning layer and shortening iteration cycles.

Agentic foundation : Future agentic recommender systems require a base model that understands item semantics, reasons, and follows stable instructions.

OneReason Overview

OneReason is the first systematic attempt to inject reasoning into a recommendation foundation model. Its core improvements are:

578 B token three‑stage pre‑training that progressively aligns recommendation and general knowledge semantics.

A CoT format based on induction, abduction, and deduction, taught during the SFT stage.

A "specialize‑then‑unify" reinforcement‑learning pipeline that balances multi‑business capabilities and makes CoT truly beneficial.

Pre‑training Design

The pre‑training data are organized into four hierarchical layers—Token, Item, Relational, and User—totaling 578 B tokens. The stages are:

Warm‑up (110 B) : Freeze the backbone, train new item embeddings.

Full‑parameter (449 B) : Jointly align all four data layers.

Long‑sequence (19 B) : Expand context window to 32 K for long user histories.

This hierarchy resolves the semantic gap between item tokens and natural language, boosting R0 item perception (+160.5 % anchor growth, +35.7 % understanding) and R3 cross‑domain recommendation (+65.1 %).

SFT Design

SFT does not treat recommendation as ordinary QA. It teaches the model to:

Perceive item semantics (R0).

Derive item‑to‑item relations using an LLM judge (R1).

Synthesize high‑quality interest‑expansion data (R2).

Generate a three‑module CoT chain—Persona Abstraction, Interest Expansion, Transition Inference—to compress long histories into a decision‑ready reasoning trace (R3).

Persona Abstraction defines 20 user‑type prototypes (e.g., family‑oriented, live‑shopping enthusiast) and extracts interpretable priors from noisy behavior. Interest Expansion experiments show a "less is more" effect: narrow hypothesis sets (n = 1‑5) yield the best performance because they keep the reasoning signal focused.

Reinforcement‑Learning Design

Because recommendation rewards are sparse and multi‑modal, OneReason adapts GRPO with three key changes:

Two‑stage trajectory generation : First generate a reasoning trace, then expand it into multiple candidate items, increasing effective trajectories.

Set‑wise reward : Evaluate a list of candidates jointly for coverage and diversity, encouraging multi‑interest exploration.

Stabilized training : Different clipping ranges for reasoning tokens vs. item tokens and down‑weighting non‑hit samples reduce gradient noise.

The pipeline also introduces a "specialize‑then‑unify" strategy: domain‑specific RL fine‑tuning (e.g., video, e‑commerce, ads, live) followed by knowledge fusion via RFT (rejection‑sampling fine‑tuning) or MOPD (multi‑teacher on‑policy distillation). RFT preserves high‑quality expert trajectories, while MOPD inherits broader multi‑domain expertise.

Benchmark and Evaluation

OneReason‑Bench defines four capability tiers (R0‑R3) and evaluates item perception, item‑to‑item inference, interest evolution, and final recommendation across four domains. Metrics include Pass@4/64, ROI, and traditional recall@K.

Key findings:

Thinking mode outperforms non‑thinking across all domains after RL; pure SFT thinking actually hurts performance.

RL‑enhanced models achieve up to 60 % lift over the strongest baseline (LC‑Rec‑PT‑SFT‑8B) in short‑video recall.

Four‑stage pre‑training raises the upper bound: LC‑Rec with OneReason weights improves ad‑domain hit rate by nearly 5×.

CoT data benefits both thinking and non‑thinking modes, but optimal CoT‑to‑unCoT ratios differ per domain (balanced for video/live, higher CoT for e‑commerce, lower for ads).

Business Impact

A 10‑day online A/B test on Kuaishou local‑life ads showed:

+10.33 % exposure, +8.23 % ad revenue, ROI > 5 for the slow‑thinking (offline) component.

Fast‑thinking (real‑time) contributed +6.83 % exposure and +4.64 % revenue.

Combined fast + slow architecture delivered +10.33 % exposure and +8.23 % revenue, translating to multi‑hundred‑million RMB annual incremental profit.

Conclusion and Outlook

OneReason demonstrates that recommendation models can indeed reason when item semantics are aligned and a proper CoT format is used. The three‑stage CoT (Persona → Interest → Transition) is effective, and reinforcement learning is essential to unlock its potential. The fast‑slow deployment proves industrial viability with strong ROI. Future work will extend the model toward agentic capabilities—planning and tool use—paving the way for fully agentic recommender systems.

Key Figures

OneReason overview
OneReason overview
Four‑stage pre‑training pipeline
Four‑stage pre‑training pipeline
Fast‑Slow deployment architecture
Fast‑Slow deployment architecture
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Recommendation SystemsReasoningChain of ThoughtReinforcement LearningPretrainingIndustrial AI
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.