The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results

An analysis of eight frontier coding agents shows that token consumption in agentic coding tasks is highly variable, often orders of magnitude higher than simple code reasoning, and that spending more tokens does not reliably improve accuracy, with significant differences across models and limited predictability of costs.

Machine Heart
Machine Heart
Machine Heart
The Hidden Token Bill of AI Coding Agents: Why More Tokens Don’t Guarantee Better Results

Agentic Coding Cost Overview

Recent coding agents such as Claude Code, Codex, and Cursor have rapidly improved their accuracy on the swe‑bench‑verified benchmark to over 78%, but they consume a large number of tokens. Users frequently complain about verbose outputs and rapid credit depletion.

Key Problems Identified

Opacity: token‑spending patterns differ across models and are not clearly disclosed.

No guarantee: tasks must be paid for regardless of success.

Unpredictability: human estimates of difficulty often do not match actual token usage.

Experimental Setup

Researchers from the University of Michigan and Stanford used the open‑source OpenHands framework to trace the token usage of eight frontier models on 500 swe‑bench‑verified problems. The models include OpenAI GPT‑5 and GPT‑5.2, Anthropic Claude Sonnet‑3.7/4/4.5, Google Gemini‑3‑Pro Preview, Moonshot AI Kimi‑K2, and Alibaba Qwen3‑Coder‑480B.

Cost Findings

Agentic coding tasks have an average input‑to‑output token ratio of 154:1, leading to exponential token and monetary costs compared with code reasoning or code‑Q&A tasks. The most expensive tasks consume about 7 million more tokens than the cheapest, and the standard deviation of token usage grows with cost. For the same task, the costliest run can be roughly twice as expensive as the cheapest run.

Token vs. Accuracy

Higher token consumption does not guarantee higher accuracy. Grouping tasks by average token usage shows that tasks with more tokens often have lower accuracy. Across four cost tiers for the same task, the highest accuracy appears at moderate cost, while both low and very high costs yield lower success rates.

Model‑Level Efficiency

Significant efficiency gaps exist between models. GPT‑5 and GPT‑5.2 achieve good accuracy with relatively low token cost, whereas Kimi‑K2 spends about 1.5 million more tokens than GPT‑5 for the same 500 tasks without a corresponding accuracy gain. These differences are systematic and stem from each model’s behavior rather than task difficulty.

Human vs. Agent Cost Prediction

Human expert difficulty ratings (<15 min, 15 min‑1 hr, >1 hr) correlate weakly with actual token consumption (Kendall τ = 0.32). Approximately 6.7 % of tasks labeled “easy” are more expensive than the average “hard” task, and 11.1 % of “hard” tasks are cheaper than the average “easy” task.

Agent Self‑Prediction Attempts

Agents were prompted to estimate their own token cost before solving a problem. Correlation between predicted and actual token usage peaks at 0.39 (Claude Sonnet‑4.5 output tokens) and generally ranges from 0.2 to 0.3 for other models. Prediction costs are typically less than half of the actual execution cost, though early Claude models sometimes exceed the execution cost.

All models tend to underestimate token consumption, especially for input tokens.

Conclusions

The study reveals that token consumption in AI coding agents is dominated by input tokens, exhibits high randomness across tasks and runs, and varies widely between models. More tokens do not ensure higher correctness, and both human difficulty estimates and agent self‑predictions provide only coarse signals for pre‑task cost estimation. Future work should focus on more efficient agent designs and better cost‑prediction and management techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI Agentsmodel efficiencycost analysiscoding agentstoken consumption
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.