Tagged articles
1 articles
Page 1 of 1
Machine Heart
Machine Heart
Jun 17, 2026 · Artificial Intelligence

TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits

The paper introduces Thinking-Based Non-Thinking (TNT), a method that dynamically caps non‑thinking token length using answer length from the thinking mode, reducing reward‑hacking probability below 10% while cutting token usage by over 46% and improving accuracy on five math benchmarks.

Dynamic Token LimitHybrid ReasoningLLM
0 likes · 10 min read
TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits