Artificial Intelligence 22 min read

How Anthropic’s Dual‑Agent Harness Overcomes Long‑Context Coding Limits

Anthropic’s Harness engineering introduces a dual‑agent architecture, JSON‑based feature anchors, strict test contracts, incremental git commits, browser‑automation validation, and a token‑efficient startup script to prevent context‑window overflow and premature completion in long‑running AI‑driven coding tasks.

AI Waka

Apr 22, 2026

How Anthropic’s Dual‑Agent Harness Overcomes Long‑Context Coding Limits

Context‑Window Boundary: The Core Challenge

Long‑running coding tasks exhaust the model’s context window (typically 128K‑200K tokens) because every file read, command output, error message, and dialogue consumes tokens; after a couple of hours, essential information is pushed out, causing the agent to lose track of its work.

Two Dominant Failure Modes

Mode 1 – Attempting to Build Everything at Once: An ambitious agent tries to implement all ten requirements from a spec in one go, writing hundreds of lines across many files before any test runs. When bugs appear, the agent cannot pinpoint their source, leading to a combinatorial debugging explosion and exhausting 80% of the token budget on interdependent errors.

Mode 2 – Premature Completion: The agent follows only the happy path, skips edge cases, and declares success after passing its own tests. Real‑world inputs then cause crashes because the agent never addressed failure scenarios, despite appearing to have completed the task.

Dual‑Agent Architecture

Anthropic separates responsibilities into an Initializer and a Coding Agent . The Initializer runs once per project, reads the specification, analyses the existing codebase, and produces a detailed JSON feature list with test definitions and skeleton files. Its context is broad but shallow, and its output is discarded after use.

The Coding Agent consumes the JSON list, works incrementally on one feature at a time, and maintains a narrow, deep context focused on the current file, test, and code changes. This separation prevents any single agent from having to retain both high‑level plans and low‑level implementation details.

Feature List as a Cognitive Anchor

The JSON feature list looks like:

{
  "features": [
    {
      "id": "F1",
      "title": "User authentication endpoint",
      "status": "complete",
      "requirements": [
        "POST /auth/login accepts email and password",
        "Returns JWT token on success",
        "Returns 401 with error message on failure",
        "Rate limits to 5 attempts per minute per IP"
      ],
      "tests": ["test_auth_login_success", "test_auth_login_failure", "test_auth_rate_limit"],
      "files": ["src/auth/routes.py", "src/auth/middleware.py", "tests/test_auth.py"]
    },
    {
      "id": "F2",
      "title": "User profile CRUD",
      "status": "in_progress",
      "requirements": [
        "GET /profile returns current user data",
        "PUT /profile updates user fields",
        "DELETE /profile soft‑deletes the account"
      ],
      "tests": ["test_profile_get", "test_profile_update", "test_profile_delete"],
      "files": ["src/profile/routes.py", "tests/test_profile.py"]
    }
  ]
}

This structure tells the Coding Agent exactly which features are done, which are pending, the required tests, and the files involved, eliminating ambiguity.

Immutable Tests

Tests are authored during the initialization phase and are prohibited from being edited by the Coding Agent. Because the agent cannot modify its own success criteria, it must satisfy the full set of edge‑case tests before a feature can be marked complete, directly countering the premature‑completion failure mode.

Incremental Progress via Commits

At the end of each coding session the agent makes a small, test‑passing git commit. This provides:

Micro‑incremental changes: The agent targets work that can be verified within a single session.

Recoverable state: Commands like git log and git diff reveal what was accomplished and what remains.

Rollback safety: If a session goes awry, git reset restores the last good commit.

Progress visibility: Humans can review commit history to monitor the agent’s trajectory.

Browser Automation for End‑to‑End Validation

For web projects the agent integrates Playwright (or similar) to run UI‑level tests defined in the feature list. This catches integration bugs that unit tests miss, such as broken button actions or failed form submissions, providing ground‑truth feedback that pure code review cannot.

Standardized Startup Sequence

Each new session begins with a deterministic script that consumes only 5–10% of the token budget (e.g., reading the JSON list, running git log --oneline -10, executing the test suite, opening the next unfinished file, and checking for TODO/FIXME markers). Compared to a naïve startup that can waste 20–30% of the window, this saves 15–20% of tokens for actual coding work.

Full Harness Integration

The six patterns—dual‑agent split, JSON feature anchor, immutable tests, commit‑driven increments, browser automation, and scripted startup—reinforce each other. Removing any one degrades the system: without the feature list the agent loses its anchor; without commit checkpoints the session cannot be safely resumed; without immutable tests the premature‑completion problem resurfaces.

Practical Recommendations

Developers can adopt these practices without using Anthropic’s stack: separate planning and execution phases, encode tasks as JSON, lock tests before coding, enforce commit checkpoints via pre‑completion hooks, script deterministic startups, and add browser‑level tests for web workloads.

Future Outlook

The next article in the series will examine how LangChain applies similar principles to intra‑session effectiveness, showing that while Anthropic solves session‑boundary constraints, LangChain tackles the quality of work within each session, together defining the state‑of‑the‑art for coding agents.

AI agents browser automation context window Agentic Coding Harness Engineering incremental commits

Written by

AI Waka

AI changes everything

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.