Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark

The newly released Humanity’s Last Exam (HLE) benchmark, featuring 2,500 rigorously crafted multimodal questions across more than 100 disciplines, exposes the severe shortcomings of leading AI models, whose accuracy stays below 50% and shows alarming calibration errors, highlighting the urgent need for deeper AI evaluation.

SuanNi
SuanNi
SuanNi
Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark

Background

Humanity’s Last Exam (HLE) is a multimodal benchmark published in Nature to probe the limits of modern AI systems. Existing evaluation tools such as MMLU, MATH, and GPQA have become saturated, with top‑tier models routinely achieving >90% accuracy, making it difficult to differentiate capabilities.

Motivation

Researchers observed that older test sets no longer challenge state‑of‑the‑art models, many questions can be answered by simple web searches or basic reasoning, and a more rigorous assessment was required. A collaborative effort involving experts from over 50 countries was launched to create a hard, interdisciplinary, multimodal benchmark.

Construction of HLE

The project offered a $500,000 prize pool to attract top scholars. The first 50 selected questions earned $5,000 each, the next 500 earned $500 each. Nearly 1,000 contributors from 500+ institutions submitted original, academically‑phrased, closed‑ended questions with LaTeX formulas where needed, along with detailed solution rationales and real author names/affiliations.

The final public test set contains 2,500 questions covering more than 100 sub‑disciplines. Subject distribution: 41% mathematics, 11% biology/medicine, 10% computer science/AI, 9% physics, 9% humanities/social sciences, 7% chemistry, 4% engineering, and 9% other interdisciplinary topics. About 14% of the items are multimodal, requiring simultaneous understanding of text and images.

Filtering and Human Review

Before inclusion, each question was run through several leading multimodal models. If a model answered correctly with ease, the question was discarded. For multiple‑choice items with five or more options, a model needed to outperform random guessing to survive.

Over 70,000 automated attempts filtered out easy questions, leaving about 13,000 candidates for human review.

Human review consisted of two rounds. In the first round, 1–3 expert judges per question checked clarity, eliminated loopholes, and refined wording. In the second round, senior experts (e.g., law PhDs, medical doctors) scored each question using a standardized rubric. Roughly 6,000 questions earned a spot in the final candidate pool, from which the 2,500 public items were selected.

Evaluation Protocol

Top multimodal models—including GPT‑4o, Claude 3.5 Sonnet, and others—were evaluated using a standardized system prompt that forced the model to generate a step‑by‑step reasoning trace before producing the final answer. An auxiliary judge (o3‑mini) verified answer formats and handled numeric tolerance.

Model Performance on HLE

All models performed dramatically worse than on previous benchmarks, with overall accuracies below 50%. GPT‑4o exhibited a calibration error of 89%, meaning it was extremely over‑confident even when wrong. Models specifically fine‑tuned for reasoning showed calibration errors above 70%.

GPT‑5, released after the public test set, achieved the highest accuracy of 25.3%, still far from human expert levels. The authors note the possibility of data leakage during training, which could inflate scores.

Insights from the Results

Models scored higher on multiple‑choice items than on precise fill‑in‑the‑blank questions, indicating that random guessing can artificially boost performance.

Analysis of token usage showed a parabolic relationship: as the number of generated tokens increased, accuracy rose up to about 16,384 tokens, after which performance declined sharply. This suggests diminishing returns from excessive computation and highlights the need for more efficient reasoning algorithms.

Conclusion

The Humanity’s Last Exam benchmark provides a hard, interdisciplinary, and multimodal test that reveals the true performance ceiling of current AI systems. Despite impressive scores on legacy tests, leading models still lack the robust, cross‑domain reasoning required for genuine scientific inquiry.

These findings call for a shift toward more efficient, trustworthy reasoning methods rather than simply scaling compute.

Reference materials:

https://www.nature.com/articles/s41586-025-09962-4

https://lastexam.ai/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

artificial intelligenceMultimodal EvaluationHumanity's Last Exam
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.