Artificial Intelligence 8 min read

Cursor’s Own Large‑Model Benchmark Shakes Up SWE‑bench Rankings

Although SWE‑bench scores for top coding models now differ by only a tenth of a point, Cursor’s newly released CursorBench reveals dramatic ranking changes, highlights three fundamental flaws in public benchmarks, and introduces token‑efficiency as a crucial evaluation dimension.

AI Insight Log

Mar 16, 2026

Cursor’s Own Large‑Model Benchmark Shakes Up SWE‑bench Rankings

Why SWE‑bench Scores May Be Misleading

The author notes that scores on SWE‑bench have converged around 80 (e.g., 80.9 vs 80.8), making it hard to distinguish truly superior programming models.

Three Core Problems with Public Benchmarks

Task type is too narrow : SWE‑bench focuses on bug‑fixing, while real development involves a wide variety of requests.

Scoring method is flawed : Only a single “standard answer” is accepted, penalising correct alternative solutions.

Data contamination : OpenAI halted SWE‑bench Verified after discovering that up to 60% of unsolved problems were contaminated, suggesting models may be memorising patches rather than solving.

How CursorBench Works

CursorBench extracts tasks from real Cursor usage logs, traces code back to the originating AI request with a tool called Cursor Blame , and reconstructs the developer’s actual intent.

Tasks come from internal codebases and controlled sources , minimising training‑data leakage.

Updates every few months to reflect evolving developer habits.

Task descriptions are deliberately brief and vague , mirroring how developers actually converse with AI.

Scale is larger : CursorBench‑3 doubles the number of lines of code and files compared with the initial version.

Ranking Reshuffle: SWE‑bench vs. CursorBench

When the same models are evaluated on CursorBench, the gaps widen dramatically. For example:

Opus 4.5: 80.9 on SWE‑bench, 48.4 on CursorBench.

Opus 4.6: 80.8 on SWE‑bench, 58.2 on CursorBench (rank 1).

Gemini 3.1: 80.6 → 50.7.

GPT‑5.2: 80.0 → 56.5.

Sonnet 4.5: 77.2 → 37.9 (near bottom).

Composer 1.5 (Cursor’s own model): 74.8 → 44.2.

These results show that high SWE‑bench scores do not necessarily translate to strong performance in real‑world coding scenarios.

Token‑Efficiency Frontier

CursorBench also reports median token usage per task. A scatter‑plot (shown in the image) reveals a frontier where models achieve high scores with fewer tokens:

GPT‑5.4 (high) – ~63% score using ~16 k tokens (top of the frontier).

GPT‑5.3 Codex (xhigh) – ~60% score using ~22 k tokens.

Opus 4.6 (high) – ~58% score using ~20 k tokens.

The same model under different configurations (high, medium, low) can vary from >50% to ~40% score while token consumption drops from >20 k to ~4 k, giving developers a concrete trade‑off between cost and quality.

Key Takeaways

For developers : CursorBench provides a more realistic reference; GPT‑5.4 currently leads, followed by Opus 4.6 and GPT‑5.2.

For model vendors : Solely chasing SWE‑bench scores is insufficient; real‑tool performance matters.

For the industry : The gap between public benchmarks and actual usage is prompting a decentralisation of evaluation, with more tool vendors likely to publish their own benchmarks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI coding large language model benchmark SWE-bench Token Efficiency CursorBench

Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.