Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test
This article walks through ten mainstream open‑source large‑model benchmarks—SWE‑bench Verified and Pro, MMLU‑Pro, GPQA Diamond, HLE, AIME, HMMT, olmOCR‑bench, Terminal‑Bench 2.0, and EvasionBench—explaining their data, evaluation metrics, current leading models, and the capability dimensions they reveal.
Open‑source large language models are evaluated on a variety of benchmarks that measure distinct capabilities. This summary details ten benchmarks, their producers, test focus, data format, primary metrics, representative examples, and the current leading open‑source models.
SWE‑bench Verified – Real‑code bug‑fix test
Producer : OpenAI × Princeton (Preparedness team)
What it tests : AI agents fixing real GitHub issues in open‑source Python projects.
Data format : 500 manually screened tasks from 12 popular repositories (e.g., Django, sympy, scikit‑learn).
Scoring : Each task provides two unit tests – FAIL_TO_PASS (must pass after the fix) and PASS_TO_PASS (must remain passing). Both must succeed for the task to be counted as solved.
Metric name : Solved task count (all tests passed).
The benchmark is called “Verified” because the original SWE‑bench contained ambiguous or unreliable tests. OpenAI engineers filtered 2,294 questions down to 500 high‑quality, human‑solvable items, creating a community‑accepted clean version.
Current top model: DeepSeek‑V4‑Pro (DeepSeek‑V4‑Flash ranks third).
SWE‑bench Pro – Industrial‑scale long‑horizon code tasks
Producer : Scale AI
What it tests : Agent performance on larger, messier, longer‑chain engineering tasks.
Data format : 1,865 manually verified tasks covering 41 repositories; average patch modifies >100 lines across multiple files.
Core innovation : Anti‑contamination design using GPL‑strong copyleft repositories plus commercial closed‑source repositories to reduce training‑data overlap.
Dataset split :
Public Set – 731 questions from 11 open‑source repos (publicly evaluable).
Held‑Out Set – 858 questions from 12 private repos (prevent over‑fitting).
Commercial Set – 276 questions from 18 commercial repos (rankings shown, data not released).
Primary metric : Resolve Rate – whether the agent’s patch builds and passes all tests inside an isolated Docker environment.
Result : Top models achieve ~25% Pass@1, providing strong discrimination.
Current top model: Kimi‑K2.6 .
MMLU‑Pro – Harder version of MMLU, 14‑subject mixed reasoning
Producer : Waterloo University TIGER‑Lab (NeurIPS 2024 paper).
What it tests : Cross‑disciplinary knowledge plus multi‑step reasoning.
Data format : Over 12,000 questions covering mathematics, physics, chemistry, biology, computer science, economics, law, psychology, philosophy, etc. (14 subjects).
Key changes : Options expanded from 4 to 10 (random‑guess probability reduced from 25% to 10%); noisy questions removed; more multi‑step reasoning items added.
Effect : Traditional MMLU scores (88‑90%) are reduced by 16‑33%, restoring discriminative power.
Reasoning impact : Adding chain‑of‑thought (CoT) can boost scores by up to 20%, indicating the benchmark measures reasoning rather than memorisation.
Current top model: Qwen3.5‑397B‑A17B (Qwen3.6 is not open‑source).
GPQA Diamond – Doctor‑level scientific reasoning
Producer : NYU + Cohere + Anthropic joint research team.
What it tests : Hard‑core reasoning in biology, physics, and chemistry at the PhD level.
Data format : 198 hardest questions selected from the original GPQA 448‑question set; all questions authored and reviewed by PhDs.
Core feature : “Google‑Proof” – even expert web searches cannot find the answers; models must rely on true understanding.
Human reference scores :
In‑discipline PhD experts: ~81% accuracy.
Out‑of‑discipline high‑level non‑experts (using web search): ~22% (≈ random).
Example : An NMR‑spectrum chemical‑shift question with four carefully crafted options that cannot be solved by shortcut searches.
Current top model: Kimi‑K2.6 .
HLE (Humanity’s Last Exam) – The final academic benchmark
Producer : Center for AI Safety × Scale AI, published in Nature (Jan 2026).
What it tests : Closed‑book exam covering the frontier of human knowledge across 100+ subjects (math, engineering, humanities, medicine, computer science, etc.).
Data format : 2,500 public questions (plus a private set to prevent over‑fitting); ~24% multiple‑choice, the rest short‑answer exact‑match; ~14% multimodal with images.
Human performance : Domain experts achieve ~90%.
Model performance : State‑of‑the‑art open‑source models reach 40‑50%.
Evaluation : Each answer is automatically verifiable (exact match or single‑choice) and also measures calibration – whether the model knows when it is wrong.
Current top model: Kimi‑K2.6 .
AIME 2026 – High‑school Olympiad math reasoning
Producer : American Invitational Mathematics Examination (MAA).
What it tests : Multi‑step symbolic reasoning in algebra, geometry, number theory, and combinatorics.
Data format : 30 questions (15 AIME I + 15 AIME II, Feb 2026); each answer is an integer 0–999; no partial credit.
Evaluation : Pass@1 exact match, closed‑book, no tools or search assistance.
Why AIME is used :
Fresh, un‑contaminated – questions released only in February, so any model trained before 2025 faces a true blind test.
Cannot be memorised – all 30 questions are new each year.
Forces chain‑of‑thought – each problem requires 5‑10 reasoning steps.
Difficulty is sufficient – harder than GSM8K or MATH.
Human baseline : Top contestants solve 4‑6 problems (≈30‑40%).
Model baseline : State‑of‑the‑art LLMs achieve >95%.
Current top model: Step‑3.5‑Flash .
HMMT Feb 2026 – Harvard‑MIT math competition
Producer : Harvard‑MIT Math Tournament; evaluation platform primarily MathArena (ETH Zurich SRI Lab).
What it tests : Similar to AIME but overall harder – positioned between AIME and International Olympiad level.
Data format : 2026 February contest questions covering algebra, geometry, number theory, combinatorics; some open‑ended answers.
Core value : Anti‑contamination – questions released immediately after the contest, ensuring models have not seen them during training.
Observed performance : Models that score >95% on AIME typically drop to 80‑90% on HMMT.
Current top model: Kimi‑K2.6 .
olmOCR‑bench – Unit‑test style OCR evaluation
Producer : Allen Institute for AI (AI2).
What it tests : OCR and document understanding on real complex documents (formulas, tables, reading order, scanned pages, multi‑column layouts, etc.).
Data format : 1,403 real or synthetic PDFs with >7,000 binary pass/fail unit tests.
Innovation : Replaces coarse page‑level edit distance with machine‑verifiable factual assertions.
Example assertions :
"This piece of text must appear in the correct order." "The variable x must be in the numerator of this mathematical formula." "Value of cell A1 in Table A must appear above cell B1." "Headers/footers should not appear in the main body."
Scenarios covered : arXiv formulas, nested tables, multi‑column layouts, old scans, dense small text, header/footer removal.
Terminal‑Bench 2.0 – Agent in a real Linux terminal
Producer : Stanford × Laude Institute, Anthropic and other frontier labs.
What it tests : AI agents completing end‑to‑end engineering tasks inside a real Linux terminal.
Data format : 80+ handcrafted tasks (version 2.0); each runs in an isolated Docker container with automated pass/fail evaluation.
Coverage : Software engineering (build/debug/deploy), system administration (server config/network), security (vulnerability assessment/encryption), scientific computing (protein assembly/data pipelines), machine learning (model training/inference deployment).
Task design principles :
Solvable – human reference solution exists.
Realistic – mirrors genuine work scenarios.
Well‑specified – success criteria are clear and automatically checkable.
Example tasks :
Compile a specific Linux kernel version and apply a patch.
Configure a self‑signed TLS certificate for an internal service.
Debug a Python async concurrency bug.
Run a full ML training run under GPU memory and precision constraints.
Evaluation framework : Harbor – manages agent lifecycles, command interaction, and logging.
Current top model: GLM‑5.1 (Claude Opus is outperformed).
EvasionBench – Detecting evasive answers
Producer : Open‑source team (IIIIQIIII), paper on arXiv 2601.09142.
What it tests : Whether a model evades sensitive or controversial questions by using evasive language, giving indirect answers, or refusing.
Data source : 2.27 M Q&A pairs from S&P Capital IQ earnings‑call transcripts; filtered to 84 k training examples and 1 k gold‑standard test items annotated by experts.
Evasion levels :
Direct – fully and clearly answers the core question.
Intermediate – provides adjacent information, sidesteps, or answers indirectly.
Fully Evasive – ignores, refuses, or goes off‑topic.
Annotation method : Multi‑Model Consensus (MMC); inter‑annotator agreement Cohen’s κ = 0.835.
Companion classifier : Eva‑4B (4 B‑parameter model fine‑tuned from Qwen3‑4B) achieves Macro‑F1 84.9% on the gold set, outperforming Claude 4.5, GPT‑5.2, and Gemini 3 Flash.
❝ LLM evaluation is shifting from "is the answer correct" to "is the answer truthful" and "does it evade" – an intriguing direction. ❞
Capability dimensions across the ten benchmarks
Code engineering : SWE‑bench Verified, SWE‑bench Pro.
Comprehensive knowledge + reasoning : MMLU‑Pro, GPQA Diamond, HLE.
Mathematical reasoning : AIME 2026, HMMT Feb 2026.
Multimodal / document understanding : olmOCR‑bench.
Agent real‑world : Terminal‑Bench 2.0.
Honesty / alignment : EvasionBench.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
