Why LLMs Overthink: ICLR2026 Study Reveals the Key Bottleneck in Inference Efficiency
The ICLR2026 paper identifies reasoning miscalibration—overthinking easy steps and underthinking critical ones—as the root cause of runaway LLM inference costs, and proposes the Budget Allocation Model (BAM) and a training‑free Plan‑and‑Budget framework that smartly distributes compute, achieving up to 70% higher accuracy while cutting token usage by 39% and boosting the new E³ efficiency metric by 193.8%.
Problem
Large language models (LLMs) often exhibit two pathological inference behaviors: overthinking , where they generate long, divergent reasoning chains on simple steps, and underthinking , where they rush through critical steps and produce incorrect answers. The common mitigation of imposing a uniform token limit reduces compute but harms accuracy on genuinely hard sub‑problems because it ignores the varying difficulty of reasoning stages.
Reasoning Miscalibration
The authors systematically analyzed mainstream LLMs—including DeepSeek‑R1, QwQ, and OpenAI o4‑mini—across mathematical reasoning, instruction following, and agentic planning tasks. They observed a systematic mismatch between the amount of computation allocated and the true difficulty of each reasoning phase. Early reasoning steps show high epistemic uncertainty (e.g., understanding the problem, choosing a solution path) and benefit from more compute, whereas later steps become increasingly certain; additional tokens yield rapidly diminishing returns and can even introduce new errors. This imbalance—"reasoning miscalibration"—produces both overthinking on low‑impact steps and underthinking on high‑impact steps.
Budget Allocation Model (BAM)
BAM treats inference as a sequence of sub‑questions. Let b_{ij} denote the number of tokens allocated to sub‑question i at step j . More tokens reduce epistemic uncertainty, but the reduction follows a law of diminishing marginal returns: the first few tokens are high‑value, while later tokens become low‑value.
Plan‑and‑Budget Framework
Plan : before inference, the original query is decomposed into a structured list of sub‑questions, clarifying the role of each step and avoiding blind exploration.
Budget : tokens are allocated to sub‑questions using a decay‑based strategy that gives more budget to early, high‑uncertainty steps and gradually reduces the budget for later, low‑uncertainty steps. This follows the optimal principle derived from BAM—spend more compute where uncertainty is high and can be effectively reduced.
Experimental Evaluation
The method was evaluated on the TravelPlanner benchmark (simple, medium, hard) and on three task families (mathematical reasoning, instruction following, agentic planning) across multiple model scales. Baselines included a global‑budget method (uniform token limit) and a uniform‑token method.
Higher pass rates than both baselines across all difficulty levels.
Average token usage lower despite higher accuracy, confirming that the method “spends less to get more”.
The new Efficiency‑aware Effectiveness Score (E³) improved by up to 193.8%.
Accuracy gains reached +70% while token consumption dropped by –39%.
These results demonstrate that inference efficiency is determined not by the total amount of compute but by its intelligent allocation.
Resources
Paper: Plan and Budget: Effective and Efficient Test‑Time Scaling on Reasoning Large Language Models (arXiv:2505.16122)
Code: https://github.com/junhongmit/P-and-B
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
