SuanNi
Mar 2, 2026 · Artificial Intelligence
Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark
The newly released Humanity’s Last Exam (HLE) benchmark, featuring 2,500 rigorously crafted multimodal questions across more than 100 disciplines, exposes the severe shortcomings of leading AI models, whose accuracy stays below 50% and shows alarming calibration errors, highlighting the urgent need for deeper AI evaluation.
Humanity's Last ExamMultimodal Evaluationartificial intelligence
0 likes · 13 min read
