Artificial Intelligence 23 min read

From Code Writing to Continuous Development: Anthropic’s Long‑Running Agent Harness Design

Anthropic’s article dissects a three‑role harness—planner, generator, evaluator—for building long‑running AI applications, explaining how structured specs, sprint contracts, iterative evaluation, and context management transform a single model into a reliable software‑engineering pipeline, with concrete front‑end and full‑stack case studies.

o-ai.tech

Mar 25, 2026

From Code Writing to Continuous Development: Anthropic’s Long‑Running Agent Harness Design

1. Core Question

The article asks: How should a harness be designed so that Claude can autonomously build a complete application over several hours and outperform a single‑agent output? It identifies two typical failure modes for long tasks: loss of focus and premature termination, and a tendency of the model to be overly forgiving of its own output.

2. From Front‑End Design to Full‑Stack Development

Anthropic first experiments with front‑end design, defining four rating dimensions—Design quality, Originality, Craft, Functionality—and showing that Claude already scores well on Craft and Functionality but lags on Design quality and Originality. An evaluator is equipped with a Playwright ‑based test harness to open pages, navigate, screenshot, and inspect layout.

3. Harness Architecture

The final harness adopts a three‑role architecture:

Planner : expands a short user request (1‑4 sentences) into a full product specification, defining scope, features, user value, and high‑level technical direction without committing to low‑level implementation details.

Generator : implements the spec in incremental sprints , each producing a subset of features. The generator uses a typical full‑stack stack (React, Vite, FastAPI, SQLite/PostgreSQL) and commits code to a Git repository.

Evaluator : acts as a QA engineer, executing real interactions (clicks, API calls, database checks) and scoring against explicit criteria. Failure of any hard threshold aborts the sprint and forces regeneration.

3.1 Sprint Contract

Before each sprint the generator and evaluator negotiate a contract that specifies the exact deliverables, acceptance criteria, verification method, and required test coverage. This contract bridges the high‑level spec and low‑level verification.

3.2 Generator Workflow

Generate HTML/CSS/JS for the requested feature.

Run a self‑check.

Hand off to evaluator for QA.

Evaluator returns a detailed critique.

Generator decides whether to continue the current direction or switch styles.

Sprints typically run 5‑15 rounds and can last up to four hours.

4. Case Study 1: 2D Retro Game Maker

Using a single‑agent approach took 20 minutes and cost ≈ $9, but the result was a non‑functional prototype with layout waste, unclear workflow, and a broken play mode. The full harness took six hours and ≈ $200, producing a richer product with 16 features, proper UI, sprite animation, behavior templates, audio, and export links. Evaluator‑generated bug reports were concrete enough for developers to fix directly.

5. Case Study 2: Browser‑Based DAW

The harness built a digital audio workstation in the browser over 3 h 50 min for $124.7. While the overall product looked complete, the evaluator identified critical “last‑mile” bugs such as non‑draggable clips, missing interactive panels, and stubbed recording. These issues highlighted the evaluator’s role in surfacing real‑world usability problems that a naïve generator would miss.

6. Context Management

Two strategies are discussed:

Compaction : summarise history and continue with the same agent, preserving continuity but risking lingering bias and “context anxiety”.

Context Reset : hand off a fresh agent with a structured artifact, providing a clean slate at the cost of more complex orchestration.

Early Anthropic models (Sonnet 4.5) required resets; later models (Opus 4.5/4.6) became stable enough to rely on automatic compaction.

7. When New Models Arrive

Each harness component encodes an assumption about model limitations (need for sprinting, resets, strict evaluation, or expanded specs). After a model upgrade, teams should audit which components remain load‑bearing and remove unnecessary scaffolding to avoid wasted cost.

8. Practical Takeaways

Long‑running tasks benefit more from workflow structure than longer prompts.

Separate generation and evaluation; a strict evaluator is a powerful performance lever.

Make evaluation criteria explicit (completeness, usability, design, code quality).

Evaluator should operate in a real environment (browser automation, API checks, DB verification).

Use structured artifacts (spec, contract, bug report) to manage state across long runs.

Bridge high‑level planning and testable acceptance with a sprint contract.

QA agents need dedicated calibration via log replay and prompt iteration.

Complex scaffolding is only valuable while it addresses current model boundaries; prune it as models improve.

9. Minimal Viable Harness for Your Project

Start with three agents and three artifacts: spec.md – generated by the planner from a short user request. acceptance.md – defines the contract for each sprint. qa_report.md – evaluator’s findings.

Workflow:

User provides a brief requirement.

Planner expands it to spec.md.

Generator implements the spec.

Evaluator checks the implementation against acceptance.md and writes qa_report.md.

If criteria are unmet, the report is fed back to the generator; repeat until the sprint succeeds or budget is exhausted.

This lightweight loop can be scaled up with additional roles, sprint contracts, or context‑reset mechanisms as needed.

10. Conclusion

Anthropic’s harness does not offer a one‑size‑fits‑all template; it provides a methodology that treats AI agents like traditional software‑engineering teams—defining requirements, breaking work into sprints, rigorously testing, and iterating. As models become more capable, the competitive edge will shift from raw model power to the quality of the harness that orchestrates sustained, reliable output.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents software engineering evaluation generator long-running planner Harness Evaluator

Written by

o-ai.tech

I’ll keep you updated with the latest AI news and tech developments in real time—let’s embrace AI together!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.