Why the Human Turing Test Is No Longer Enough: Agents’ Last Exam Benchmark
The article introduces Agents’ Last Exam (ALE), a comprehensive benchmark created by Berkeley and over 250 experts to evaluate generalist computer‑use agents on real‑world, multi‑step workflows across 55 sub‑fields, revealing that even the strongest models achieve only single‑digit pass rates.
