Artificial Intelligence 11 min read

Claude Fable 5 Scores Zero on ALE’s Hardest Tier – Insights from the Final Agent Exam

Claude Fable 5 tops the Agent Arena leaderboard but fails completely on the most difficult ALE tasks, highlighting higher costs, zero pass rates on the hardest tier, and the broader challenges still facing generalist computer‑use agents.

Machine Heart

Jun 12, 2026

Claude Fable 5 Scores Zero on ALE’s Hardest Tier – Insights from the Final Agent Exam

Anthropic’s newly released Claude Fable 5 generated a stir in the AI community, and the Agent Arena benchmark placed it first (High) ahead of OpenAI’s GPT‑5.5 (xHigh) on overall scores, confirmation success rate, and steerability.

The Agent Arena benchmark evaluates agents on millions of real‑world, long‑duration tasks that require web search, file system access, terminal commands, and the ability to write code, create slides, conduct web research, build applications, and analyze documents.

In contrast, the Agents’ Last Exam (ALE) benchmark—developed by Dawn Song’s team at UC Berkeley—measures whether agents can reliably perform economically valuable work across a wide range of real‑world domains. ALE covers 55 non‑physical professions, more than 1,500 tasks, and draws contributions from over 300 experts representing 100+ institutions.

When ALE evaluated Claude Fable 5, GPT‑5.5, Composer 2.5, and other leading agents, the results were mixed:

Overall pass rate: GPT‑5.5 leads with 24.0 % versus Fable 5’s 22.0 %.

Cost per task: Fable 5 averages $15.70, GPT‑5.5 $3.80, and Composer 2.5 $1.33, meaning Fable 5 is roughly 4–12× more expensive for comparable performance.

Hardest “Last‑Exam” tier: all front‑line agents, including Fable 5, achieved a 0 % pass rate.

A sub‑benchmark called ALE‑CLI, which only supports command‑line environments, covers 40 of ALE’s 55 sub‑domains—far more than Terminal‑Bench (6) or SWE‑bench‑Pro (5). Its tasks are longer (hours to weeks) and harder, with the best agent achieving only a 25.2 % pass rate compared to 82.0 % on Terminal‑Bench and 59.1 % on SWE‑bench‑Pro.

The authors note that no single agent dominates across all scenarios; each model has strengths and weaknesses. While average scores cluster closely, the critical insight is where agents succeed or fail and how failure patterns vary by domain.

Common failure modes include agents declaring task completion without verifying their work, often missing files, mis‑counting items, omitting key fields, or violating explicit constraints.

ALE’s design emphasizes three criteria for tasks: representativeness (real professional workflows using domain‑specific software), complexity (end‑to‑end deliverables requiring substantial expert effort), and verifiability (outputs that can be deterministically checked against clear scoring rubrics).

Task collection follows a rigorous pipeline: experts submit real project proposals, which undergo multi‑stage review (initial screening, engineer trial runs, and final peer review). Only tasks passing all stages become part of the benchmark, with a small public subset (150 of 1,490 instances) released to mitigate benchmark‑pollution risks.

Technically, ALE decomposes each benchmark instance into three decoupled components that interact via well‑defined interfaces, enabling flexible evaluation of generalist computer‑use agents (GCUA) such as Claude Code or Codex.

The team hopes ALE will serve as a new reference point and “north star” for the industry, guiding the development of agents capable of reliably delivering economic value across diverse fields.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cost analysis Claude Fable 5 Agents' Last Exam AI agent benchmarks generalist computer-use agents hardest tier failure

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.