Industry Insights 19 min read

How Open LLM Leaderboard v2 Redefines LLM Evaluation with New Benchmarks and Fair Scoring

Open LLM Leaderboard v2 introduces a revamped, reproducible evaluation framework for large language models, replacing saturated benchmarks with six carefully curated, unpolluted datasets, applying standardized scoring, updating the harness, adding voting and maintainer‑recommended models, and providing richer visualizations to guide the AI community.

Baobao Algorithm Notes

Jun 27, 2024

How Open LLM Leaderboard v2 Redefines LLM Evaluation with New Benchmarks and Fair Scoring

Background and Motivation

The original Open LLM Leaderboard was created to provide a reproducible, comparable platform for evaluating large language models (LLMs) because published scores often lacked code, relied on prompt tricks, or were inflated by marketing claims. Over its first year the leaderboard attracted more than two million visitors and 300,000 monthly community submissions.

Why a More Challenging Leaderboard?

After extensive use, three major problems emerged:

Benchmarks became too easy; models reached human‑level performance on HellaSwag, MMLU, and ARC, indicating saturation.

Newer models showed signs of data contamination, inflating scores on datasets that resemble training data (e.g., GSM8K, TruthfulQA).

Some benchmarks contained errors, such as flawed MMLU versions (MMLU‑Redux, MMLU‑Pro) and unfair token‑based scoring in GSM8K.

New Benchmark Selection

We selected six high‑quality, unpolluted datasets that measure distinct capabilities:

MMLU‑Pro : an expert‑reviewed, 10‑choice version of the massive multitask language understanding benchmark, reducing noise and difficulty.

GPQA : a graduate‑level knowledge benchmark authored by domain experts, accessed via a gated API to prevent contamination.

MuSR : a multi‑step soft‑reasoning dataset with ~1000‑token problems (mysteries, placement, team allocation) that requires long‑context reasoning.

MATH : a subset of challenging high‑school competition math problems formatted in LaTeX and Asymptote, with strict answer formatting.

IFEval : an instruction‑following benchmark that tests strict adherence to keyword and format requirements.

BBH : a curated subset of 23 difficult tasks from BigBench, covering arithmetic, algorithmic reasoning, language understanding, and world knowledge.

Scoring Methodology

Final model scores are now computed by normalizing each benchmark’s raw score to a 0–100 scale (0 = random baseline, 100 = perfect). The normalized scores are then averaged, giving each benchmark equal weight regardless of raw score magnitude. This change makes the overall ranking more reflective of true capability, especially for generative tasks where raw scores can be misleading.

Updated Evaluation Suite

We continue to use EleutherAI’s lm-eval harness but have frozen a specific version to guarantee reproducibility. The new harness adds support for delta‑weight (LoRA) models, integrates a logging system compatible with the leaderboard, and enables evaluation with chat templates. All task implementations have been manually audited and a test suite is being added to detect future regressions.

Community Features

A voting system now lets authenticated Hugging Face users up‑vote submitted models; the most‑voted models receive priority in the evaluation queue. We also introduced a “Maintainer‑Recommended” list, curated by the community and the Hugging Face team, to highlight high‑quality, widely useful LLMs.

Interface Improvements

The frontend has been accelerated with a Gradio component that loads data client‑side for near‑instant filtering and searching. A new visualizer (https://open-llm-leaderboard-generationvisualizer.hf.space/) lets users explore evaluation results interactively.

Top Model Rankings (v2)

⭐ Qwen/Qwen2-72B‑Instruct – average score 43.02

2 Meta‑Llama‑3‑70B‑Instruct – average score 36.67

3 Microsoft/Phi‑3‑medium‑4k‑instruct

4 01‑ai/Yi‑1.5‑34B‑Chat

5 CohereForAI/c4ai‑command‑r‑plus

6 abacusai/Smaug‑72B‑v0.1

Insights on Evaluation Correlations

Correlation analysis shows that MMLU‑Pro and BBH are highly related and both align well with human preference scores from the LMSys chatbot arena. IFEval strongly reflects instruction‑following ability, favoring chat‑tuned models. MATH‑Level5 correlates with GSM8K, but some models that were penalized on GSM8K now perform well on the new math benchmark.

Future Directions

Looking at the evolution of 7,400 evaluated models, we observe a shift from larger to smaller models while maintaining performance, a trend that benefits deployment efficiency. With the more demanding benchmarks in v2, we expect the baseline to be lower, providing ample room for future improvements.

We will continue to archive past results (https://hf.co/open-llm-leaderboard-old) and expand the leaderboard to keep pace with rapid LLM advances.

model comparison reproducibility LLM evaluation AI metrics community voting Open LLM Leaderboard