How Anthropic’s Harness Keeps Long‑Running AI Agents on Track
The article analyzes Anthropic’s Harness design for long‑running applications, detailing how it mitigates context anxiety and self‑evaluation bias through sprint contracts, rubric scoring, and a planner‑generator‑evaluator architecture, and evaluates its effectiveness across multiple versions.
Problem Statement
Anthropic’s "Harness design for long‑running application development" addresses two concrete distortions that appear when an agent runs for many hours: context anxiety – the model tends to finish early when it perceives the context window is near its limit. self‑evaluation – the model over‑rates its own output, especially on subjective tasks such as UI design.
Architecture Overview
The system separates responsibilities into three agents:
Planner – expands a high‑level requirement into a concrete specification (deliverables only, no implementation details).
Generator – implements the specification.
Evaluator – independently reviews the generated artefacts, scores them, and produces actionable bug reports.
Key mechanisms that mitigate the distortions:
Sprint contract – a file‑based acceptance agreement that defines completion criteria and verification steps for each sprint.
Rubric – a quantitative scoring sheet with four dimensions: design quality, originality, craft, and functionality. The first two dimensions receive higher weight to discourage template‑like outputs.
Context management – two strategies are distinguished: compaction: compress history within the same session. context reset: spin up a fresh agent and hand off state via structured artefacts when compression is insufficient.
Design Decisions
Planner only defines *what* to build, leaving *how* to the generator, preventing cascading specification errors.
Generator and evaluator are separate models; the evaluator remains a skeptical reviewer, which is easier to calibrate than making a single model self‑criticise.
Sprint contracts are materialised as files rather than informal chat, enforcing hard thresholds for each rubric dimension.
Empirical Results
RetroForge (V1, Opus 4.5) – Full harness ran ~6 h costing ~$200 versus a single‑agent baseline of ~20 min costing $9. The evaluator produced concrete bug reports (e.g., missing tile fills, incorrect API routing), enabling targeted fixes.
DAW (V2, Opus 4.6) – Sprint structure was removed; the generator ran continuously for ~4 h (cost $124). Even with a stronger model, the evaluator still caught “last‑mile” issues such as non‑draggable timeline segments and incomplete audio controls.
When to Retire Scaffold Components
The team continuously asks whether a component exists because the current model cannot handle the task. As model capabilities improve, scaffolding (e.g., sprint contracts) is trimmed, mirroring mature engineering practice of keeping only load‑bearing layers.
Layered Runtime Model
Constant constraint layer – CLAUDE.md / rules enforce long‑term identity and boundaries.
Method loading layer – Skills inject domain knowledge and methods on demand.
Deterministic control layer – hooks / permission pipelines handle tasks unsuitable for model judgement.
Long‑task runtime layer – the Harness orchestrates hand‑off, correction, and acceptance.
Action risk control layer – Auto Mode / safety classifier defines permissible actions.
The overall goal is not to stack more agents but to shrink the portion of work that must rely on the model’s self‑awareness, using structured layers, scoring rubrics, and contracts to keep long‑running AI work reliable.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
