How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Six Hours
Anthropic’s engineering blog details a multi‑agent harness that splits generation and evaluation tasks, tackles Claude’s context‑anxiety and self‑assessment issues, and demonstrates through front‑end design and full‑stack app experiments how the system can run continuously for hours with higher quality output.
Problem Statement
Anthropic engineer Prithvi Rajasekaran observed two blockers when using Claude Sonnet 4.5 for long‑running tasks: context anxiety , where the model anticipates the context window limit and truncates work, and self‑evaluation distortion , where Claude over‑rates its own output, especially on subjective tasks.
The remedy proposed is to separate the generator (the working agent) from the evaluator (the critiquing agent), borrowing the adversarial loop idea from GANs.
GAN‑Inspired Framework
The generator agent produces code or designs, while the evaluator agent scores and critiques the output, feeding back improvements. By giving the evaluator a distinct, more critical persona, the system avoids forcing the generator to self‑criticize.
Experiment 1: Front‑End Design
Claude tends to produce safe, bland layouts. Four scoring dimensions were defined:
Design quality : cohesive visual identity
Originality : intentional creative decisions
Craftsmanship : typography, spacing, color contrast
Functionality : usability and discoverability
Higher weight was given to design quality and originality. The evaluator used Playwright MCP to interact with the generated page, taking screenshots for assessment. Each iteration took 5–15 runs, with a full generation lasting up to four hours.
After nine rounds the design was a conventional dark‑theme site; in round ten the generator pivoted to a 3D‑space museum experience with CSS‑perspective chessboard flooring—an aesthetic leap not seen in single‑agent runs. Prompt wording such as “the best design is museum‑grade” directly steered the visual direction.
Experiment 2: Full‑Stack App Development
The same logic was extended to a three‑agent harness:
Planner : expands a brief requirement into a detailed spec without prescribing implementation details.
Generator : implements features sprint‑by‑sprint, self‑evaluates before handing off.
Evaluator : drives Playwright MCP to click through the app, checks UI, API endpoints, and database state against sprint contracts; any failure below a hard threshold triggers a redo.
Before each sprint the Generator proposes a plan; the Evaluator reviews and approves the contract via a hand‑off file, avoiding direct dialogue.
Using this harness, a “2D retro game maker” (RetroForge) was built. Compared with a single‑agent run:
Single agent: 20 min, $9
Three‑agent harness: 6 h, $200The harness version produced a playable game with moving characters, jump platforms, and AI‑generated game elements, whereas the single‑agent version failed to run core gameplay and contained broken code links.
Detailed bug logs from Sprint 3 illustrate the evaluator’s precision (code line and function names):
fillRectangle function exists but not triggered on mouseUp.
Deletion requires both selection and selectedEntityId, but only the latter was set.
PUT /frames/reorder route parsed incorrectly as integer frame_id, yielding a 422 error.
Calibrating the evaluator required multiple log reviews, prompt tweaks, and iterative runs to reach a strict enough standard.
Opus 4.6 Simplifies the Harness
With Opus 4.6, context‑anxiety largely disappears, and built‑in self‑code review reduces the need for explicit sprint contracts. The three‑agent system (now without sprint segmentation) built a browser‑based DAW in 3 h 50 min for $124.70, delivering a full arrangement view, mixer, and transport, plus an AI‑controlled prompt interface.
Second‑round QA still uncovered substantive issues: the recording stub, missing clip‑drag implementation, and UI sliders lacking graphical EQ curves, all fed back for correction.
Key Engineering Takeaways
The evaluator is most valuable when the generator operates near its capability limits; stronger models may render it optional.
Each harness component encodes an assumption about model limitations; these assumptions should be revisited as models improve.
Adopt a spec‑first, implementation‑agnostic planner to avoid cascading errors downstream.
Community Feedback
Comments note that while the harness validates multi‑agent workflows, its architecture remains centralized. Future work could decentralize planner, generator, and evaluator agents with open protocols for dynamic discovery.
Another observation highlights that the evaluator does more than judge—it shapes the generated artifact through prompt language, effectively making prompt phrasing part of the system’s architecture.
The blog’s greatest value lies in exposing the gritty engineering details—evaluator calibration, sprint contract design, and context‑reset strategies—rather than merely stating that multi‑agent beats single‑agent.
ShiZhen AI
Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
