From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning

The paper introduces PreRL, which removes the input condition to directly optimize the reasoning trajectory (P(y)) of large language models, and combines it with standard RL in Dual Space RL (DSRL), achieving consistent gains on math and out‑of‑distribution benchmarks, faster training, and richer reasoning behaviors.

DSRLMath BenchmarksPreRL

0 likes · 11 min read

From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning