Artificial Intelligence 9 min read

The Hidden Cost of Cheaper LLMs: Why Extra Reasoning Tokens Make Them More Expensive

A recent study by researchers from Stanford, UC Berkeley, Carnegie Mellon, and Microsoft reveals a price‑reversal phenomenon where lower‑priced large language models incur higher actual costs because they consume far more reasoning tokens, making true cost prediction highly unpredictable.

Machine Heart

Apr 14, 2026

The Hidden Cost of Cheaper LLMs: Why Extra Reasoning Tokens Make Them More Expensive

Audit Framework

The study evaluated eight widely used inference models—GPT‑5.2, GPT‑5 Mini, Gemini 3.1 Pro, Gemini 3 Flash, Claude Opus 4.6, Claude Haiku 4.5, Kimi K2.5, and MiniMax M2.5—on nine benchmark datasets (including AIME, Humanity’s Last Exam, MMLUPro). All models use a pay‑per‑token pricing scheme with separate rates for input and output tokens. For a given query the cost is the weighted sum of input‑token price × input‑token count and output‑token price × output‑token count; the paper reports results using average weighting.

Price‑Reversal Phenomenon

Figure 1 (left) plots listed API prices against actual costs on real tasks; the right side ranks models by price and by cost separately. The analysis finds a systematic inversion: cheaper‑priced models often incur higher actual costs. For example, GPT‑5.2’s API price is 4.5 × that of Gemini 3 Flash, yet its real cost is only 81 % of Gemini 3 Flash’s. Conversely, Claude Opus 4.6 costs twice as much as Gemini 3.1 Pro by list price but is 35 % cheaper in practice. Across all 28 model pairs and nine tasks (252 pairwise cost comparisons), 55 cases (21.8 %) exhibit a price‑reversal, meaning cost judgments based solely on list price are wrong roughly one out of every five times.

Reasoning Tokens as Hidden Driver

Figure 3 breaks model costs into input, reasoning, and output tokens. Reasoning tokens dominate, accounting for nearly 90 % of total expense, while input and output together contribute less than 10 %. The disparity is stark: Gemini 3 Flash generates almost ten times more reasoning tokens than GPT‑5.2 for the same task.

Concrete Example

Figure 4 shows an AIME 2025 problem solved by both GPT‑5.2 and Gemini 3 Flash. Both produce the same answer, but GPT‑5.2 uses about 562 reasoning tokens, whereas Gemini 3 Flash consumes over 11 000, leading to a 2.5 × higher actual cost for the latter.

Effect of Removing Reasoning‑Token Cost

When the reasoning‑token component is excluded, the correlation between listed price and actual cost improves dramatically across all nine tasks, and the number of pairwise rank reversals drops by roughly 70% (Figure 5). This confirms that reasoning tokens are the hidden driver of the price‑reversal effect.

Unpredictability of Reasoning Token Count

Repeated runs on the same AIME tasks reveal massive variability in reasoning‑token counts. For a single task, GPT‑5.2’s token usage ranged from 20 k to 50 k—a 2.5 × spread—demonstrating that reasoning‑token quantity is highly stochastic. Consequently, actual cost is intrinsically unpredictable, making reliable cost forecasting extremely difficult (Figure 6).

Open Resources

Data and analysis code are open‑sourced at https://github.com/lchen001/pricing-reversal and an interactive website is provided at https://price-reversal.streamlit.app/.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM model benchmarking AI cost price reversal cost unpredictability reasoning tokens

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.