How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Six Hours

Anthropic’s engineering blog details a multi‑agent harness that splits generation and evaluation tasks, tackles Claude’s context‑anxiety and self‑assessment issues, and demonstrates through front‑end design and full‑stack app experiments how the system can run continuously for hours with higher quality output.

ShiZhen AI
ShiZhen AI
ShiZhen AI
How Anthropic’s Multi‑Agent Harness Keeps Claude Running for Six Hours

Problem Statement

Anthropic engineer Prithvi Rajasekaran observed two blockers when using Claude Sonnet 4.5 for long‑running tasks: context anxiety , where the model anticipates the context window limit and truncates work, and self‑evaluation distortion , where Claude over‑rates its own output, especially on subjective tasks.

The remedy proposed is to separate the generator (the working agent) from the evaluator (the critiquing agent), borrowing the adversarial loop idea from GANs.

GAN‑Inspired Framework

The generator agent produces code or designs, while the evaluator agent scores and critiques the output, feeding back improvements. By giving the evaluator a distinct, more critical persona, the system avoids forcing the generator to self‑criticize.

Experiment 1: Front‑End Design

Claude tends to produce safe, bland layouts. Four scoring dimensions were defined:

Design quality : cohesive visual identity

Originality : intentional creative decisions

Craftsmanship : typography, spacing, color contrast

Functionality : usability and discoverability

Higher weight was given to design quality and originality. The evaluator used Playwright MCP to interact with the generated page, taking screenshots for assessment. Each iteration took 5–15 runs, with a full generation lasting up to four hours.

After nine rounds the design was a conventional dark‑theme site; in round ten the generator pivoted to a 3D‑space museum experience with CSS‑perspective chessboard flooring—an aesthetic leap not seen in single‑agent runs. Prompt wording such as “the best design is museum‑grade” directly steered the visual direction.

Experiment 2: Full‑Stack App Development

The same logic was extended to a three‑agent harness:

Planner : expands a brief requirement into a detailed spec without prescribing implementation details.

Generator : implements features sprint‑by‑sprint, self‑evaluates before handing off.

Evaluator : drives Playwright MCP to click through the app, checks UI, API endpoints, and database state against sprint contracts; any failure below a hard threshold triggers a redo.

Before each sprint the Generator proposes a plan; the Evaluator reviews and approves the contract via a hand‑off file, avoiding direct dialogue.

Using this harness, a “2D retro game maker” (RetroForge) was built. Compared with a single‑agent run:

Single agent: 20 min, $9
Three‑agent harness: 6 h, $200

The harness version produced a playable game with moving characters, jump platforms, and AI‑generated game elements, whereas the single‑agent version failed to run core gameplay and contained broken code links.

Detailed bug logs from Sprint 3 illustrate the evaluator’s precision (code line and function names):

fillRectangle function exists but not triggered on mouseUp.

Deletion requires both selection and selectedEntityId, but only the latter was set.

PUT /frames/reorder route parsed incorrectly as integer frame_id, yielding a 422 error.

Calibrating the evaluator required multiple log reviews, prompt tweaks, and iterative runs to reach a strict enough standard.

Opus 4.6 Simplifies the Harness

With Opus 4.6, context‑anxiety largely disappears, and built‑in self‑code review reduces the need for explicit sprint contracts. The three‑agent system (now without sprint segmentation) built a browser‑based DAW in 3 h 50 min for $124.70, delivering a full arrangement view, mixer, and transport, plus an AI‑controlled prompt interface.

Second‑round QA still uncovered substantive issues: the recording stub, missing clip‑drag implementation, and UI sliders lacking graphical EQ curves, all fed back for correction.

Key Engineering Takeaways

The evaluator is most valuable when the generator operates near its capability limits; stronger models may render it optional.

Each harness component encodes an assumption about model limitations; these assumptions should be revisited as models improve.

Adopt a spec‑first, implementation‑agnostic planner to avoid cascading errors downstream.

Community Feedback

Comments note that while the harness validates multi‑agent workflows, its architecture remains centralized. Future work could decentralize planner, generator, and evaluator agents with open protocols for dynamic discovery.

Another observation highlights that the evaluator does more than judge—it shapes the generated artifact through prompt language, effectively making prompt phrasing part of the system’s architecture.

The blog’s greatest value lies in exposing the gritty engineering details—evaluator calibration, sprint contract design, and context‑reset strategies—rather than merely stating that multi‑agent beats single‑agent.

AIMulti-agentClaudeAnthropicAgent HarnessOpus
ShiZhen AI
Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.