Machine Heart
Jun 17, 2026 · Artificial Intelligence
TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits
The paper introduces Thinking-Based Non-Thinking (TNT), a method that dynamically caps non‑thinking token length using answer length from the thinking mode, reducing reward‑hacking probability below 10% while cutting token usage by over 46% and improving accuracy on five math benchmarks.
Dynamic Token LimitHybrid ReasoningLLM
0 likes · 10 min read
