26 min read

Why Single-Agent AI Fails: Anthropic’s Multi-Agent Harness for Long-Running Tasks

The article explains that single‑agent AI collapses on long‑running tasks due to compound error probabilities, outlines four structural failure modes, and presents Anthropic’s three‑agent GAN‑style harness—Planner, Generator, Evaluator—detailing sprint contracts, primitives, token economics, and three real‑world case studies that demonstrate dramatically higher reliability and productivity.

AI Waka

Apr 28, 2026

Why Single-Agent AI Fails: Anthropic’s Multi-Agent Harness for Long-Running Tasks

Single‑Agent AI Structural Failure

Single‑agent AI behaves like a long chain of dominoes: each step may be "reliable", but the overall success probability multiplies. With a 95% per‑step accuracy, a 20‑step task succeeds only 36% (0.95²⁰≈0.36). The problem is architectural, not model quality, and becomes more severe as steps increase.

Four Structural Failure Modes

1. Context Anxiety – When the context window fills, the agent truncates work, skips verification, and emits premature success signals.

2. Sycophantic Self‑Evaluation – The model cannot reliably critique its own output, leading to confident but low‑quality results.

3. Architectural Drift – Over many micro‑decisions the agent loses sight of the original goal, producing unintended features and contradictory decisions.

4. Documentation Rot – As execution proceeds, inline documentation diverges from actual code, leaving stale comments and mismatched summaries.

Evidence from SWE‑bench Pro

SWE‑bench Pro measures long‑range software‑engineering tasks across 1,865 problems in 41 repositories. Short‑range models score >70% on SWE‑bench Verified, while long‑range scores start around 23% and rise to 64.3% as newer models (Claude Opus 4.7, GPT‑5.4) improve, still far below short‑range performance.

Important background: Even as absolute scores increase, the structural gap between short‑ and long‑range performance remains; a model that scores 70%+ on short tasks still scores dramatically lower on long tasks.

Three‑Agent GAN‑Style Harness

Anthropic’s solution separates roles into a GAN‑like adversarial loop:

Planner – Defines high‑level scope and sprint contracts; runs on Opus for maximal reasoning.

Generator – Executes each sprint using Sonnet for cost‑efficiency; never evaluates its own output.

Evaluator – Acts as a strict judge, using real‑time tool verification (e.g., Playwright MCP) and runs on Opus for deep analysis.

Adversarial Loop Diagram

[Planner]→Sprint Contract
↓
[Generator]→Implementation
↓
[Evaluator]→Structured Evaluation (pass/fail + scores)
↓
If FAIL → Generator revises based on feedback
↓
[Evaluator]→Re‑evaluation
… (5‑15 rounds) …
If PASS → Sprint complete, advance to next

Sprint Contracts

A Sprint Contract is a JSON document created by the Planner that defines "completion" before any code is written. It contains four parts:

Feature Scope – observable behavior, not implementation details.

Verification Methods – concrete steps (e.g., Playwright test of OAuth flow).

Pass/Fail Thresholds – numeric or boolean criteria.

Edge‑Case Traps – specific failure modes to test.

Example:

{
  "sprint": 3,
  "feature_scope": "OAuth2 authentication with JWT session management",
  "verification_methods": [
    "Playwright MCP: complete GitHub OAuth flow, verify JWT returned",
    "Playwright MCP: access /api/protected endpoint with valid JWT",
    "curl: verify 401 response on /api/protected without JWT"
  ],
  "pass_fail_thresholds": {
    "all_playwright_scenarios_pass": true,
    "http_500_responses": 0,
    "jwt_issued_on_successful_login": true,
    "page_load_seconds": 2
  },
  "edge_case_traps": [
    "Expired JWT: verify 401, not 500",
    "Malformed JWT: verify 400, not unhandled exception",
    "Revoked token: verify 401 with appropriate error message"
  ]
}

Design Dimensions and Scoring

Beyond binary pass/fail, four design dimensions are scored 1‑10:

Design Quality – visual hierarchy, spacing, color harmony.

Originality – creative deviation from templates.

Craft – pixel‑perfect execution and smooth interaction.

Functionality – whether all features work as specified.

Core Primitives Enabling Hours‑Long Runs

Filesystem‑as‑Memory – Persistent TODO.md and CHANGELOG.md files store state across context resets.

Context Reset vs. Compaction – Full context clearing at each sprint avoids cumulative bias; the harness supplies only the sprint contract and state files.

Managed Agents – Separate Session (persistent logs), Harness (orchestration), and Sandbox (ephemeral execution) components ensure crash‑resilience.

Proof‑of‑Concept Case Studies

1. RetroForge (2D game editor) – Single‑agent finished in 20 minutes, delivered 0/27 acceptance criteria; three‑agent harness took 6 hours, met all 27 criteria, and cost ~200 tokens versus near‑zero usable output.

2. Digital Audio Workstation (DAW) – Built in ~4 hours using Claude Opus 4.6; a sub‑agent music assistant was recursively created inside the DAW, demonstrating hierarchical harness composition.

3. Cosmology Boltzmann Solver – JAX implementation achieved 0.1% numerical agreement with a reference C version; the Ralph Loop self‑repair mechanism and a Test Oracle compared outputs to guarantee scientific accuracy.

Token Economics

Adversarial runs consume roughly 15× the tokens of a comparable single‑agent run, but the cost is justified by preventing catastrophic errors. Model selection is tiered: Planner and Evaluator use Opus (high depth), Generator uses Sonnet (cost‑effective), and summarization uses Haiku (low depth).

Agentic Ops

Analogous to DevOps, Agentic Ops introduces four operational primitives:

Credential Isolation – minimal permissions per agent.

Durable Sessions – persistent dialogue history for restartability.

Automated Judges – monitor token spend, hallucinations, policy violations.

MCP‑style Tool Protocol – standardizes how agents discover and invoke tools.

Conclusion

Mathematical compounding (0.95²⁰≈0.36) shows that hoping for success is not a strategy. The future lies in adversarial, contract‑driven multi‑agent systems that make failure structurally impossible. Building such harnesses turns a few hundred dollars of compute into production‑grade software across games, audio tools, and scientific solvers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Anthropic Multi-Agent AI long-running tasks AI Harness Agentic Ops GAN Analogy

Written by

AI Waka

AI changes everything

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.