Why the Human Turing Test Is No Longer Enough: Agents’ Last Exam Benchmark
The article introduces Agents’ Last Exam (ALE), a comprehensive benchmark created by Berkeley and over 250 experts to evaluate generalist computer‑use agents on real‑world, multi‑step workflows across 55 sub‑fields, revealing that even the strongest models achieve only single‑digit pass rates.
Motivation
Recent AI progress has pushed the strongest Claude model to nearly 65% on the Human Last Exam (HLE), saturating traditional question‑answer benchmarks. To assess capabilities that matter for economic productivity, a Berkeley‑led team of >250 industry experts created a harder benchmark called Agents’ Last Exam (ALE).
The authors label the gap between benchmark success and real‑world output as the “utility problem”. Existing benchmarks are narrow, easy to collect, and focus on short synthetic tasks, unlike core economic domains such as finance, law, and engineering that lack comparable evaluations. ImageNet’s impact on computer‑vision research is cited as an example of how a widely adopted, verifiable metric can accelerate progress.
Benchmark Design
ALE comprises 960 expert‑written workflows (1,490 task instances) covering 55 sub‑fields within 13 industry clusters. Tasks are sourced from real projects completed by practitioners and vetted through multiple quality‑control rounds. The industry taxonomy follows the O*NET/SOC 2018 classification, and non‑digital occupations are excluded.
Task Selection Criteria
Representativeness : workflows must use the software actually employed by professionals (e.g., SolidWorks or Rhino for architects, not AutoCAD).
Complexity : tasks require end‑to‑end deliverables and multiple UI interactions; single‑step operations are omitted. Example: moving a cheetah from one video to another in DaVinci Resolve demands tracking, masking, compositing, and color grading.
Verifiability : outputs must be deterministically checkable. Example: an RPG level built with RPG Maker XP can be compared against a reference file because map geometry, character stats, and event states are fully scripted.
Difficulty Tiers
Near‑Term : 59 tasks; current agents achieve up to 42% pass rate.
Full‑Spectrum : 55 tasks; guarantees at least one instance per sub‑field for comprehensive evaluation.
Last‑Exam : 36 hardest tasks; most agents obtain 0% full pass rate.
Agent Capability Model
ALE evaluates Generalist Computer‑Use Agents (GCUA) that must operate both via graphical user interfaces (GUI) and command‑line interfaces (CLI). The authors decompose agent functionality into five layers:
Brain : LLM reasoning and planning.
Eyes : GUI perception through screenshots.
Body : Flow control and orchestration.
Hands : Structured tool invocation (e.g., API calls, file operations).
Feet : Runtime execution of commands and scripts.
Traditional CLI agents have Brain, Body, Hands, and Feet but lack Eyes; pure GUI agents have Brain and Eyes but miss Body, Hands, and Feet, preventing them from completing ALE’s full‑spectrum tasks.
Evaluation Infrastructure
Each task runs on a remote virtual machine with a standardized directory layout:
input/ # read‑only assets
software/ # pre‑installed applications
output/ # sole writable location for the agent
reference/ # ground‑truth files used only for scoringAgents observe the environment, select actions, and execute until termination. Scoring is deterministic; 93.2% of tasks can be auto‑graded without human input. For the remaining tasks, narrow, evidence‑anchored yes/no checks are performed by an LLM rather than open‑ended judgment.
Results
On the Last‑Exam tier, the strongest configuration (Codex + GPT‑5.5) achieves an overall pass rate of 8.6%, while the average across mainstream agents is 2.6%. Claude Code + Opus 4.7 scores 0% full pass and a mean of 2.1%.
Failure Analysis (Claude Code + Opus 4.7)
31% of failures stem from misunderstanding the task description.
47% stem from selecting an incorrect method despite correct understanding.
22% stem from execution errors when the method is correct.
Thus, domain knowledge is the primary bottleneck rather than raw execution ability. Agents default to CLI tools even when a task specifies a GUI application: 34% of tasks require GUI interaction, yet agents use GUI minimally.
Model vs. Framework Impact
Controlled experiments swapping only the LLM while keeping the agent framework fixed produced a performance swing of up to 18 percentage points. Swapping only the framework while fixing the model altered performance by about 5–6 points, indicating that model choice is roughly three times more influential than framework selection.
Domain‑wise Performance
GPT‑5.5 and Opus 4.7 exhibit similar domain profiles: highest scores (~60%) in computation/math and agriculture/environment; lowest scores (<30%) in visual media and education. This reflects uneven training coverage, with code‑related domains receiving far more data than specialized professional workflows.
Limitations and Future Work
The benchmark is biased toward software‑centric, digitized professions; blue‑collar and physical‑operation jobs are excluded. Tasks run on Linux or Windows VMs, and coverage across the 55 sub‑fields is uneven (e.g., energy & nuclear engineering has 4 tasks, law has 15). The public subset contains only ~10% of the full pool (150 of 1,167 tasks). Correlation between public and full‑pool scores is 0.89, suggesting reasonable but imperfect representativeness.
Planned extensions include adding new workflows and industries, rotating private tasks into the public set, and maintaining ALE as a living benchmark that continuously bridges the gap between benchmark success and real‑world economic impact.
References
https://agents-last-exam.org/
https://arxiv.org/pdf/2606.05405v1
https://github.com/rdi-berkeley/agents-last-exam
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
