The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results
An analysis of eight frontier coding agents shows that token consumption in agentic coding tasks is highly variable, often orders of magnitude higher than simple code reasoning, and that spending more tokens does not reliably improve accuracy, with significant differences across models and limited predictability of costs.
Agentic Coding Cost Overview
Recent coding agents such as Claude Code, Codex, and Cursor have rapidly improved their accuracy on the swe‑bench‑verified benchmark to over 78%, but they consume a large number of tokens. Users frequently complain about verbose outputs and rapid credit depletion.
Key Problems Identified
Opacity: token‑spending patterns differ across models and are not clearly disclosed.
No guarantee: tasks must be paid for regardless of success.
Unpredictability: human estimates of difficulty often do not match actual token usage.
Experimental Setup
Researchers from the University of Michigan and Stanford used the open‑source OpenHands framework to trace the token usage of eight frontier models on 500 swe‑bench‑verified problems. The models include OpenAI GPT‑5 and GPT‑5.2, Anthropic Claude Sonnet‑3.7/4/4.5, Google Gemini‑3‑Pro Preview, Moonshot AI Kimi‑K2, and Alibaba Qwen3‑Coder‑480B.
Cost Findings
Agentic coding tasks have an average input‑to‑output token ratio of 154:1, leading to exponential token and monetary costs compared with code reasoning or code‑Q&A tasks. The most expensive tasks consume about 7 million more tokens than the cheapest, and the standard deviation of token usage grows with cost. For the same task, the costliest run can be roughly twice as expensive as the cheapest run.
Token vs. Accuracy
Higher token consumption does not guarantee higher accuracy. Grouping tasks by average token usage shows that tasks with more tokens often have lower accuracy. Across four cost tiers for the same task, the highest accuracy appears at moderate cost, while both low and very high costs yield lower success rates.
Model‑Level Efficiency
Significant efficiency gaps exist between models. GPT‑5 and GPT‑5.2 achieve good accuracy with relatively low token cost, whereas Kimi‑K2 spends about 1.5 million more tokens than GPT‑5 for the same 500 tasks without a corresponding accuracy gain. These differences are systematic and stem from each model’s behavior rather than task difficulty.
Human vs. Agent Cost Prediction
Human expert difficulty ratings (<15 min, 15 min‑1 hr, >1 hr) correlate weakly with actual token consumption (Kendall τ = 0.32). Approximately 6.7 % of tasks labeled “easy” are more expensive than the average “hard” task, and 11.1 % of “hard” tasks are cheaper than the average “easy” task.
Agent Self‑Prediction Attempts
Agents were prompted to estimate their own token cost before solving a problem. Correlation between predicted and actual token usage peaks at 0.39 (Claude Sonnet‑4.5 output tokens) and generally ranges from 0.2 to 0.3 for other models. Prediction costs are typically less than half of the actual execution cost, though early Claude models sometimes exceed the execution cost.
All models tend to underestimate token consumption, especially for input tokens.
Conclusions
The study reveals that token consumption in AI coding agents is dominated by input tokens, exhibits high randomness across tasks and runs, and varies widely between models. More tokens do not ensure higher correctness, and both human difficulty estimates and agent self‑predictions provide only coarse signals for pre‑task cost estimation. Future work should focus on more efficient agent designs and better cost‑prediction and management techniques.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
