Machine Learning Algorithms & Natural Language Processing
Jun 18, 2026 · Artificial Intelligence
TNT: Dynamic Token Limits Slash Reward Hacking in Mixed Inference Models Below 10%
The paper introduces Thinking‑Based Non‑Thinking (TNT), a reinforcement‑learning approach that sets a per‑question dynamic token ceiling for non‑thinking mode using the answer length from thinking mode, cutting reward‑hacking incidence to under 10% while boosting accuracy and cutting token usage by nearly half across several math benchmarks.
ACL 2026NLPTNT
0 likes · 10 min read
