TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits
The paper introduces Thinking-Based Non-Thinking (TNT), a method that dynamically caps non‑thinking token length using answer length from the thinking mode, reducing reward‑hacking probability below 10% while cutting token usage by over 46% and improving accuracy on five math benchmarks.
