Anthropic’s Agent Harness: Six‑Hour Full‑Stack Build with Multi‑Agent Design
The article analyzes Anthropic’s “Agent harness” design, showing how separating generation and evaluation into distinct agents—drawing inspiration from GANs—overcomes context‑window limits and self‑evaluation bias, enabling a three‑agent planner‑generator‑evaluator pipeline that builds a full‑stack app in six hours.
Based on Anthropic’s blog post “Harness Design for Long‑Running Application Development” (Prithvi Rajasekaran, 2026‑03‑24), this article adds explanations of the underlying concepts.
Rajasekaran pursued two goals: improving Claude’s front‑end design quality and letting Claude develop complete applications without human intervention. Both hit bottlenecks, which he solved with a multi‑agent harness.
What is a harness?
In software engineering a test harness wraps a program, controls its execution, and captures output. An “Agent harness” similarly wraps AI agents, orchestrating call order, passing context, and coordinating collaboration.
Problems with a single long‑running agent
Context anxiety : The model’s context window is limited; as code, logs, and dialogue accumulate, the window fills and the model prematurely wraps up work. Two mitigation strategies are described:
Reset : clear the entire context and start over, losing previous work.
Compaction : summarize the existing context in place, keeping key information.
For Sonnet 4.5, reset proved more reliable than compaction.
Self‑evaluation bias : When an agent judges its own output it is overly optimistic, similar to humans reviewing their own code. This bias is especially harmful for subjective tasks such as design quality, where no hard correctness metric exists.
Solution: Separate generation and evaluation
The core idea is borrowed from GANs (Generative Adversarial Networks), where a generator creates content and a discriminator evaluates it. In the agent system the two roles are independent, not adversarially trained.
Front‑end design practice
Two agents are used:
Generator Agent : creates a front‑end page from the requirement.
Evaluator Agent : uses Playwright to interact with the generated page, scoring it on four dimensions—Design Quality, Originality, Craft, and Functionality—and providing revision suggestions.
The wording of the scoring criteria influences generation; adding a phrase like “the best design is museum‑grade” steers the generator toward that visual style. Even without explicit feedback, embedding the four criteria in the generator’s prompt improves the first‑round output over a baseline.
Evaluator runs the page in a real browser, not just static code inspection, because rendering differences can hide bugs. Each full run typically requires 5‑15 iterations and about four hours.
Iteration is non‑linear. In a museum‑website case the generator abandoned a dark theme at iteration 10 and switched to a 3D room rendered with CSS perspective. Occasionally an intermediate iteration outperforms the final version, and the generator tends to propose increasingly ambitious designs.
Three‑Agent harness for full‑stack development
Planner
The Planner expands a brief user request (1‑4 sentences) into detailed product specifications, defining functional boundaries, user stories, and high‑level design direction. It deliberately avoids early technical decisions (e.g., fixing PostgreSQL for data storage) to keep later implementation flexible, and it adds visual design language for the Generator.
Generator
The Generator implements the plan using React + Vite for the front end, FastAPI for the back end, and SQLite/PostgreSQL for the database, with git for version control. Before handing off, it performs a quick self‑evaluation to filter obvious errors (e.g., service not starting, wrong port), allowing the Evaluator to focus on higher‑value issues.
Evaluator
The Evaluator also uses Playwright to operate the app as a real user—clicking UI, invoking APIs, checking database state—to verify functional correctness. Specific bugs it uncovered include:
Rectangle‑fill tool only placing blocks at the start and end points, leaving the middle empty.
Delete‑key logic requiring two conditions simultaneously when only one is needed.
FastAPI route mistakenly matching the string reorder as an integer parameter, causing a 422 error.
These issues are not obvious from static code but become evident through real‑world interaction.
Sprint Contract
Early versions introduced a “Sprint Contract” where Generator and Evaluator negotiate the deliverables and acceptance criteria for each sprint, preventing mismatched expectations about completion.
Effect comparison: Single Agent vs. Harness
Using the same task—building a retro‑style game‑making tool—with Opus 4.5, the results were:
Single Agent : 20 minutes, $9, UI usable but core entities not correctly connected.
Full Harness : 6 hours, $200, polished UI, functional game mechanics, AI‑generated sprites and levels.
The cost‑to‑quality trade‑off shows that a prototype can be cheap, but a reliable product requires substantially more investment.
Evolution with a stronger model (Opus 4.6)
When switching to Opus 4.6, many early components became redundant. Two simplifications were made:
Remove the Sprint Contract because the newer model handles long‑task execution more stably.
Change the Evaluator from per‑iteration checks to a single post‑implementation QA, reducing calls while still catching significant bugs.
A new benchmark using a digital‑audio‑workstation (DAW) task yielded the following timings and costs (rounded): Planner 4.7 min $0.46; first implementation 2 h 7 min $71.08; first QA 8.8 min $3.24; second implementation 1 h 2 min $36.89; second QA 6.8 min $3.09; third implementation 10.9 min $5.88; third QA 9.6 min $4.06; total 3 h 50 min $124.70. QA discovered missing audio capture, absent drag‑drop/scale controls, and a missing graphics editor; the Generator fixed these and added Claude‑driven composition. Overall time and cost decreased while quality remained reliable.
Key insights
Each harness component encodes an assumption about model capability; as models improve, those assumptions may become obsolete.
Combining task decomposition (Planner → Generator → Evaluator) with specialized agents yields benefits greater than the sum of its parts, and the effect grows with stronger models.
Regularly revisit and ablate harness components after model upgrades; structures that once patched weaknesses may now add latency and cost.
More capable base models expand the space of effective agent combinations rather than diminishing their value.
In summary, harness design is an ongoing engineering practice that must evolve alongside model advances to maintain effectiveness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Shi's AI Notebook
AI technology observer documenting AI evolution and industry news, sharing development practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
