Artificial Intelligence 16 min read

How Effective Harnesses Keep Long‑Running AI Agents Productive

The article analyzes why AI agents lose progress across discrete context windows, identifies two failure patterns, and presents a dual‑harness solution—an initialization agent and a coding agent—that uses init scripts, progress files, and Git to enable incremental, test‑driven development over hours or days.

ZhiKe AI

May 13, 2026

How Effective Harnesses Keep Long‑Running AI Agents Productive

Long‑Running Agent Challenges

AI agents that are asked to perform multi‑hour or multi‑day tasks lose state between discrete sessions because each new session starts without memory of prior work. Limited context windows prevent a single session from completing a full project, so agents need a mechanism to bridge the gap between coding sessions.

Dual‑Solution Harness

The solution consists of two agents built on the Claude Agent SDK:

Initialization agent runs once to set up the environment. It creates an init.sh script, a claude-progress.txt file that records work, and an initial Git commit that adds the repository files.

Coding agent runs in every subsequent session. It makes incremental progress on a single feature, commits the changes with a descriptive message, and updates the progress file so the next session can continue cleanly.

Environment Management

The updated Claude 4 prompt guide recommends using a distinct first‑session prompt to initialize the environment with everything the coding agent will need. Key components are:

A comprehensive feature‑list JSON file that enumerates all required functionality.

An init.sh script that can start the development server.

A claude-progress.txt file that the coding agent updates after each iteration.

Feature List

To prevent the agent from trying to complete the whole application at once, the initialization agent writes a detailed JSON file. In the claude.ai clone example the list contains over 200 entries such as “user can open a new chat, type a query, press Enter and see an AI response”. All entries start with "passes": false so later coding agents have a clear outline of the expected final state.

{
  "category": "functional",
  "description": "New chat button creates a fresh conversation",
  "steps": [
    "Navigate to main interface",
    "Click the 'New Chat' button",
    "Verify conversation appears in sidebar"
  ],
  "passes": false
}

The coding agent is instructed to edit only the passes field. Strong wording (e.g., “deleting or editing tests is unacceptable”) is used to keep the model from corrupting the JSON. Experiments showed JSON to be more robust than Markdown for this purpose.

Incremental Progress

With the scaffold in place, each coding‑agent iteration is asked to handle a single feature. After making code changes the agent commits to Git with a descriptive message and writes a summary to the progress file. This enables the model to recover from bad changes via Git and keeps the repository in a clean, merge‑ready state.

These practices also reduce wasted tokens because the agent no longer has to guess the current state of the project.

Testing

A common failure mode is the agent marking a feature as complete without proper testing. When no explicit test instruction is given, Claude may run curl commands that miss end‑to‑end failures. By explicitly prompting the agent to use a browser‑automation tool (Puppeteer MCP) and run the same tests a human would, end‑to‑end verification improves dramatically.

Providing such testing tools lets Claude recognize and fix bugs that are not obvious from the code alone. Limitations remain, such as Claude’s inability to see native browser modal alerts, which can cause missed bugs.

Quick Start

Run pwd to see the working directory; only files in that directory may be edited.

Read the Git log and claude-progress.txt to understand recent work.

Open the feature‑list JSON and pick the highest‑priority unfinished feature.

Before implementing a new feature, run the init.sh script to start the development server and execute a basic end‑to‑end test. In the claude.ai clone this means launching the local server, using Puppeteer MCP to start a new chat, send a message, and verify the response, ensuring the app is not left in a broken state.

Agent Failure Modes and Solutions

Problem: Claude declares the whole project finished too early.

Initialization agent behavior: Creates a structured JSON feature list based on the specification.

Coding agent behavior: Reads the feature list at session start and selects a single feature to work on.

Problem: Environment ends up with bugs or unrecorded progress.

Initialization agent behavior: Writes the initial Git repository and progress‑note file.

Coding agent behavior: Begins the session by reading the progress note and Git log, runs basic tests on the dev server, then writes a Git commit and progress update before ending.

Problem: Claude marks a feature as passed without verification.

Initialization agent behavior: Sets up the feature list.

Coding agent behavior: Self‑validates all features and only marks them "passed" after careful testing.

Problem: Claude spends time figuring out how to run the app.

Initialization agent behavior: Provides an init.sh script that can start the development server.

Coding agent behavior: Executes init.sh at session start, then proceeds with feature work.

Future Work

The study demonstrates a viable set of solutions for long‑running agent harnesses, but several open questions remain. It is unclear whether a single general‑purpose coding agent performs best across contexts or whether a multi‑agent architecture (e.g., dedicated testing, QA, or cleanup agents) yields better results. Extending these findings beyond full‑stack web development—to scientific research, financial modeling, or other long‑duration AI tasks—is a promising direction.

[Assistant] I'll start by getting my bearings and understanding the current state of the project.
[Tool Use] <bash - pwd>
[Tool Use] <read - claude-progress.txt>
[Tool Use] <read - feature_list.json>
[Assistant] Let me check the git log to see recent work.
[Tool Use] <bash - git log --oneline -20>
[Assistant] Now let me check if there's an init.sh script to restart the servers.
<Starts the development server>
[Assistant] Excellent! Now let me navigate to the application and verify that some fundamental features are still working.
<Tests basic functionality>
[Assistant] Based on my verification testing, I can see that the fundamental functionality is working well. The core chat features, theme switching, conversation loading, and error handling are all functioning correctly. Now let me review the tests.json file more comprehensively to understand what needs to be implemented next.
<Starts work on a new feature>

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents context management long-running incremental development harness design Claude Agent SDK

Written by

ZhiKe AI

We dissect AI-era technologies, tools, and trends with a hardcore perspective. Focused on large models, agents, MCP, function calling, and hands‑on AI development. No fluff, no hype—only actionable insights, source code, and practical ideas. Get a daily dose of intelligence to simplify tech and make efficiency tangible.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.