From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning

The paper introduces PreRL, which removes the input condition to directly optimize the reasoning trajectory (P(y)) of large language models, and combines it with standard RL in Dual Space RL (DSRL), achieving consistent gains on math and out‑of‑distribution benchmarks, faster training, and richer reasoning behaviors.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning

Core Idea: Directly Optimize Reasoning Trajectory

Existing RL methods for large‑model reasoning improve the conditional distribution P(y|x) by optimizing with a specific problem context. The authors ask whether it is necessary to condition on the input at all, given that massive pre‑training already internalizes reasoning knowledge.

PreRL – Optimizing the Marginal Distribution

PreRL removes the input condition during updates and optimizes the marginal distribution P(y) directly, applying reward signals to the generated reasoning trajectory itself rather than to a particular problem. This approach aims to reshape the internal reasoning structures learned during pre‑training.

Dual Space RL (DSRL)

DSRL combines a PreRL warm‑up phase with a standard RL fine‑tuning phase using a Policy Reincarnation strategy. The warm‑up phase (NSR‑PreRL) updates only on negative‑reward samples, pruning erroneous reasoning paths and stimulating the model’s innate reasoning ability. The subsequent RL phase (GRPO) refines the policy under normal conditioning.

Gradient Alignment Verification

The authors verify that gradients from the marginal‑distribution objective align closely with those from standard RL on high‑probability, deterministic tokens, diverging only on early or highly uncertain tokens. This theoretical result supports the marginal objective as an effective proxy for standard RL.

Positive vs. Negative Sample Reinforcement

Positive‑sample reinforcement (PSR) fails to learn from online trajectories in the pre‑train space, leading to performance degradation.

Negative‑sample reinforcement (NSR) dramatically improves reasoning: after only 20 steps, transition thoughts increase by 14.89× and reflection thoughts by 6.54×, outperforming the GRPO baseline with three times fewer training steps.

Experimental Setup

Experiments use Qwen3‑4B and Qwen3‑8B as base models, training on the MATH dataset and evaluating on six benchmarks (MATH500, AMC23, AIME24, AIME25, Minerva, OlympiadBench). The authors also assess Pass@K performance and out‑of‑distribution (OOD) generalization on GPQA‑Diamond, MMLU‑Pro, BBH, and HumanEval.

Results

DSRL consistently outperforms strong baselines (GRPO, PPO, Reinforce++, RLOO, Dr.GRPO, DAPO) on Avg@32 across all benchmarks; e.g., on AIME24 Qwen3‑4B gains 4.69 points, on AIME25 gains 2.50 points.

Qwen3‑8B with DSRL achieves the highest average score (58.47) among all methods.

Pass@K curves show DSRL dominates GRPO across the entire sampling budget, indicating better solution diversity.

On OOD benchmarks, DSRL yields notable improvements (e.g., +3.79 on GPQA‑Diamond, +5.37 on MMLU‑Pro for Qwen3‑4B; +2.44 on HumanEval for Qwen3‑8B).

Behavioral Analysis

DSRL accelerates the emergence of four reasoning behaviors—Subgoal Setting, Enumeration, Verification, Backtracking—starting from the NSR‑PreRL warm‑up. The number of fully solved problems rises sharply while fully unsolved problems drop, confirming systematic elimination of failure patterns.

Ablation of Warm‑up Steps

Warm‑up steps between 10 and 25 achieve the best trade‑off; fewer steps under‑activate the mechanism, while excessive steps cause over‑exploration. NSR‑PreRL warm‑up (57.54) outperforms NSR‑RL warm‑up (54.38) by 3.16 points, directly validating the benefit of removing the input condition.

Conclusion

The study demonstrates that reinforcement learning for large‑model reasoning does not have to be conditioned on the problem context. Directly optimizing the reasoning trajectory (PreRL) aligns well with standard RL gradients, and when combined with a standard RL fine‑tuning phase (DSRL), it yields superior accuracy, training efficiency, richer reasoning behaviors, and stronger OOD generalization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsreasoningreinforcement learningDSRLmath benchmarksPreRL
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.