Cursor’s Own Large‑Model Benchmark Shakes Up SWE‑bench Rankings
Although SWE‑bench scores for top coding models now differ by only a tenth of a point, Cursor’s newly released CursorBench reveals dramatic ranking changes, highlights three fundamental flaws in public benchmarks, and introduces token‑efficiency as a crucial evaluation dimension.
Why SWE‑bench Scores May Be Misleading
The author notes that scores on SWE‑bench have converged around 80 (e.g., 80.9 vs 80.8), making it hard to distinguish truly superior programming models.
Three Core Problems with Public Benchmarks
Task type is too narrow : SWE‑bench focuses on bug‑fixing, while real development involves a wide variety of requests.
Scoring method is flawed : Only a single “standard answer” is accepted, penalising correct alternative solutions.
Data contamination : OpenAI halted SWE‑bench Verified after discovering that up to 60% of unsolved problems were contaminated, suggesting models may be memorising patches rather than solving.
How CursorBench Works
CursorBench extracts tasks from real Cursor usage logs, traces code back to the originating AI request with a tool called Cursor Blame , and reconstructs the developer’s actual intent.
Tasks come from internal codebases and controlled sources , minimising training‑data leakage.
Updates every few months to reflect evolving developer habits.
Task descriptions are deliberately brief and vague , mirroring how developers actually converse with AI.
Scale is larger : CursorBench‑3 doubles the number of lines of code and files compared with the initial version.
Ranking Reshuffle: SWE‑bench vs. CursorBench
When the same models are evaluated on CursorBench, the gaps widen dramatically. For example:
Opus 4.5: 80.9 on SWE‑bench, 48.4 on CursorBench.
Opus 4.6: 80.8 on SWE‑bench, 58.2 on CursorBench (rank 1).
Gemini 3.1: 80.6 → 50.7.
GPT‑5.2: 80.0 → 56.5.
Sonnet 4.5: 77.2 → 37.9 (near bottom).
Composer 1.5 (Cursor’s own model): 74.8 → 44.2.
These results show that high SWE‑bench scores do not necessarily translate to strong performance in real‑world coding scenarios.
Token‑Efficiency Frontier
CursorBench also reports median token usage per task. A scatter‑plot (shown in the image) reveals a frontier where models achieve high scores with fewer tokens:
GPT‑5.4 (high) – ~63% score using ~16 k tokens (top of the frontier).
GPT‑5.3 Codex (xhigh) – ~60% score using ~22 k tokens.
Opus 4.6 (high) – ~58% score using ~20 k tokens.
The same model under different configurations (high, medium, low) can vary from >50% to ~40% score while token consumption drops from >20 k to ~4 k, giving developers a concrete trade‑off between cost and quality.
Key Takeaways
For developers : CursorBench provides a more realistic reference; GPT‑5.4 currently leads, followed by Opus 4.6 and GPT‑5.2.
For model vendors : Solely chasing SWE‑bench scores is insufficient; real‑tool performance matters.
For the industry : The gap between public benchmarks and actual usage is prompting a decentralisation of evaluation, with more tool vendors likely to publish their own benchmarks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
