Tagged articles

Math Benchmarks

2 articles · Page 1 of 1
Machine Heart
Machine Heart
Jun 17, 2026 · Artificial Intelligence

TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits

The paper introduces Thinking-Based Non-Thinking (TNT), a method that dynamically caps non‑thinking token length using answer length from the thinking mode, reducing reward‑hacking probability below 10% while cutting token usage by over 46% and improving accuracy on five math benchmarks.

Dynamic Token LimitHybrid ReasoningLLM
0 likes · 10 min read
TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 19, 2026 · Artificial Intelligence

From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning

The paper introduces PreRL, which removes the input condition to directly optimize the reasoning trajectory (P(y)) of large language models, and combines it with standard RL in Dual Space RL (DSRL), achieving consistent gains on math and out‑of‑distribution benchmarks, faster training, and richer reasoning behaviors.

DSRLMath BenchmarksPreRL
0 likes · 11 min read
From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning