Mar 2, 2026 · Artificial Intelligence

Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark

The newly released Humanity’s Last Exam (HLE) benchmark, featuring 2,500 rigorously crafted multimodal questions across more than 100 disciplines, exposes the severe shortcomings of leading AI models, whose accuracy stays below 50% and shows alarming calibration errors, highlighting the urgent need for deeper AI evaluation.

Humanity's Last ExamMultimodal Evaluationartificial-intelligence

0 likes · 13 min read

Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark