From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning
The paper introduces PreRL, which removes the input condition to directly optimize the reasoning trajectory (P(y)) of large language models, and combines it with standard RL in Dual Space RL (DSRL), achieving consistent gains on math and out‑of‑distribution benchmarks, faster training, and richer reasoning behaviors.
Core Idea: Directly Optimize Reasoning Trajectory
Existing RL methods for large‑model reasoning improve the conditional distribution P(y|x) by optimizing with a specific problem context. The authors ask whether it is necessary to condition on the input at all, given that massive pre‑training already internalizes reasoning knowledge.
PreRL – Optimizing the Marginal Distribution
PreRL removes the input condition during updates and optimizes the marginal distribution P(y) directly, applying reward signals to the generated reasoning trajectory itself rather than to a particular problem. This approach aims to reshape the internal reasoning structures learned during pre‑training.
Dual Space RL (DSRL)
DSRL combines a PreRL warm‑up phase with a standard RL fine‑tuning phase using a Policy Reincarnation strategy. The warm‑up phase (NSR‑PreRL) updates only on negative‑reward samples, pruning erroneous reasoning paths and stimulating the model’s innate reasoning ability. The subsequent RL phase (GRPO) refines the policy under normal conditioning.
Gradient Alignment Verification
The authors verify that gradients from the marginal‑distribution objective align closely with those from standard RL on high‑probability, deterministic tokens, diverging only on early or highly uncertain tokens. This theoretical result supports the marginal objective as an effective proxy for standard RL.
Positive vs. Negative Sample Reinforcement
Positive‑sample reinforcement (PSR) fails to learn from online trajectories in the pre‑train space, leading to performance degradation.
Negative‑sample reinforcement (NSR) dramatically improves reasoning: after only 20 steps, transition thoughts increase by 14.89× and reflection thoughts by 6.54×, outperforming the GRPO baseline with three times fewer training steps.
Experimental Setup
Experiments use Qwen3‑4B and Qwen3‑8B as base models, training on the MATH dataset and evaluating on six benchmarks (MATH500, AMC23, AIME24, AIME25, Minerva, OlympiadBench). The authors also assess Pass@K performance and out‑of‑distribution (OOD) generalization on GPQA‑Diamond, MMLU‑Pro, BBH, and HumanEval.
Results
DSRL consistently outperforms strong baselines (GRPO, PPO, Reinforce++, RLOO, Dr.GRPO, DAPO) on Avg@32 across all benchmarks; e.g., on AIME24 Qwen3‑4B gains 4.69 points, on AIME25 gains 2.50 points.
Qwen3‑8B with DSRL achieves the highest average score (58.47) among all methods.
Pass@K curves show DSRL dominates GRPO across the entire sampling budget, indicating better solution diversity.
On OOD benchmarks, DSRL yields notable improvements (e.g., +3.79 on GPQA‑Diamond, +5.37 on MMLU‑Pro for Qwen3‑4B; +2.44 on HumanEval for Qwen3‑8B).
Behavioral Analysis
DSRL accelerates the emergence of four reasoning behaviors—Subgoal Setting, Enumeration, Verification, Backtracking—starting from the NSR‑PreRL warm‑up. The number of fully solved problems rises sharply while fully unsolved problems drop, confirming systematic elimination of failure patterns.
Ablation of Warm‑up Steps
Warm‑up steps between 10 and 25 achieve the best trade‑off; fewer steps under‑activate the mechanism, while excessive steps cause over‑exploration. NSR‑PreRL warm‑up (57.54) outperforms NSR‑RL warm‑up (54.38) by 3.16 points, directly validating the benefit of removing the input condition.
Conclusion
The study demonstrates that reinforcement learning for large‑model reasoning does not have to be conditioned on the problem context. Directly optimizing the reasoning trajectory (PreRL) aligns well with standard RL gradients, and when combined with a standard RL fine‑tuning phase (DSRL), it yields superior accuracy, training efficiency, richer reasoning behaviors, and stronger OOD generalization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
