Beware the Cost Reversal in LLMs: Are Cheaper Models More Expensive?
A recent study of eight popular large language models across nine benchmark tasks shows that lower‑priced APIs often lead to higher actual expenses because inference token usage varies dramatically, making model cost highly unpredictable and exposing a hidden "boots" phenomenon.
When choosing large language models (LLMs), practitioners usually compare API prices, assuming cheaper models are more economical. However, a joint study by researchers from Stanford, UC Berkeley, Carnegie Mellon, and Microsoft (arXiv:2603.23971) demonstrates a price‑reversal phenomenon: models with lower listed prices can generate substantially higher real‑world costs.
The authors evaluated eight widely used inference models—GPT‑5.2, GPT‑5 Mini, Gemini 3.1 Pro, Gemini 3 Flash, Claude Opus 4.6, Claude Haiku 4.5, Kimi K2.5, and MiniMax M2.5—on nine mainstream datasets such as AIME, Humanity’s Last Exam, and MMLUPro. All models employ a per‑token pay‑as‑you‑go pricing scheme, charging separately for input and output tokens. By weighting token counts, the study computed the average cost per query for each model‑task pair.
Figure 1 (not shown) reveals that the ranking by listed price diverges sharply from the ranking by actual cost. For example, Gemini 3 Flash’s API price is only 22 % of GPT‑5.2’s, yet on MMLUPro its actual cost is six times higher. Across 28 model pairs and nine tasks (252 pairwise comparisons), 21.8 % of comparisons exhibit a price reversal, meaning that relying solely on API pricing would mislead roughly one out of five cost decisions.
The authors explain this reversal with the “Boots Theory”: a cheap product may cost more over its lifetime. In LLMs, the hidden factor is the number of inference tokens. Input prompts and final outputs typically account for less than 10 % of total cost; the majority stems from the internal reasoning (inference) tokens. Different models consume vastly different numbers of inference tokens for the same task—for instance, Gemini 3 Flash uses almost ten times more inference tokens than GPT‑5.2 on the same AIME question, leading to a 2.5× higher actual cost.
Removing inference‑token cost from the calculation restores a strong correlation between listed price and actual expense and reduces the number of rank reversals by about 70 % (Figure 5). This confirms that inference tokens are the primary driver of the cost reversal.
Further experiments show that inference‑token counts are highly variable even for a fixed model and task. Re‑running the same AIME task five times with GPT‑5.2 produced token counts ranging from 20 k to 50 k, a 2.5× spread, indicating intrinsic randomness and making precise cost prediction extremely difficult.
In conclusion, the study uncovers a “boots phenomenon” in the AI model domain: lower‑priced LLMs can be more expensive in practice, and actual costs are unpredictable due to volatile inference‑token usage. The authors released the underlying dataset (https://github.com/lchen001/pricing-reversal) and an interactive website (https://price-reversal.streamlit.app/) to support further research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
