Tagged articles
1 articles
Page 1 of 1
SuanNi
SuanNi
Mar 2, 2026 · Artificial Intelligence

Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark

The newly released Humanity’s Last Exam (HLE) benchmark, featuring 2,500 rigorously crafted multimodal questions across more than 100 disciplines, exposes the severe shortcomings of leading AI models, whose accuracy stays below 50% and shows alarming calibration errors, highlighting the urgent need for deeper AI evaluation.

Humanity's Last ExamMultimodal Evaluationartificial intelligence
0 likes · 13 min read
Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark