TNT: Dynamic Token Limits Slash Reward Hacking in Mixed Inference Models Below 10%

The paper introduces Thinking‑Based Non‑Thinking (TNT), a reinforcement‑learning approach that sets a per‑question dynamic token ceiling for non‑thinking mode using the answer length from thinking mode, cutting reward‑hacking incidence to under 10% while boosting accuracy and cutting token usage by nearly half across several math benchmarks.

ACL 2026NLPTNT

0 likes · 10 min read

TNT: Dynamic Token Limits Slash Reward Hacking in Mixed Inference Models Below 10%

Machine Heart

Jun 17, 2026 · Artificial Intelligence

TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits

The paper introduces Thinking-Based Non-Thinking (TNT), a method that dynamically caps non‑thinking token length using answer length from the thinking mode, reducing reward‑hacking probability below 10% while cutting token usage by over 46% and improving accuracy on five math benchmarks.

Dynamic Token LimitHybrid ReasoningLLM

0 likes · 10 min read

TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits