Tagged articles
2 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 18, 2026 · Artificial Intelligence

TNT: Dynamic Token Limits Slash Reward Hacking in Mixed Inference Models Below 10%

The paper introduces Thinking‑Based Non‑Thinking (TNT), a reinforcement‑learning approach that sets a per‑question dynamic token ceiling for non‑thinking mode using the answer length from thinking mode, cutting reward‑hacking incidence to under 10% while boosting accuracy and cutting token usage by nearly half across several math benchmarks.

ACL 2026NLPTNT
0 likes · 10 min read
TNT: Dynamic Token Limits Slash Reward Hacking in Mixed Inference Models Below 10%
Machine Heart
Machine Heart
Jun 17, 2026 · Artificial Intelligence

TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits

The paper introduces Thinking-Based Non-Thinking (TNT), a method that dynamically caps non‑thinking token length using answer length from the thinking mode, reducing reward‑hacking probability below 10% while cutting token usage by over 46% and improving accuracy on five math benchmarks.

Dynamic Token LimitHybrid ReasoningLLM
0 likes · 10 min read
TNT Prevents Reward Hacking in Hybrid Reasoning Models by Dynamic Token Limits