How Effective Harnesses Keep Long‑Running AI Agents Productive
The article analyzes why AI agents lose progress across discrete context windows, identifies two failure patterns, and presents a dual‑harness solution—an initialization agent and a coding agent—that uses init scripts, progress files, and Git to enable incremental, test‑driven development over hours or days.
Long‑Running Agent Challenges
AI agents that are asked to perform multi‑hour or multi‑day tasks lose state between discrete sessions because each new session starts without memory of prior work. Limited context windows prevent a single session from completing a full project, so agents need a mechanism to bridge the gap between coding sessions.
Dual‑Solution Harness
The solution consists of two agents built on the Claude Agent SDK:
Initialization agent runs once to set up the environment. It creates an init.sh script, a claude-progress.txt file that records work, and an initial Git commit that adds the repository files.
Coding agent runs in every subsequent session. It makes incremental progress on a single feature, commits the changes with a descriptive message, and updates the progress file so the next session can continue cleanly.
Environment Management
The updated Claude 4 prompt guide recommends using a distinct first‑session prompt to initialize the environment with everything the coding agent will need. Key components are:
A comprehensive feature‑list JSON file that enumerates all required functionality.
An init.sh script that can start the development server.
A claude-progress.txt file that the coding agent updates after each iteration.
Feature List
To prevent the agent from trying to complete the whole application at once, the initialization agent writes a detailed JSON file. In the claude.ai clone example the list contains over 200 entries such as “user can open a new chat, type a query, press Enter and see an AI response”. All entries start with "passes": false so later coding agents have a clear outline of the expected final state.
{
"category": "functional",
"description": "New chat button creates a fresh conversation",
"steps": [
"Navigate to main interface",
"Click the 'New Chat' button",
"Verify conversation appears in sidebar"
],
"passes": false
}The coding agent is instructed to edit only the passes field. Strong wording (e.g., “deleting or editing tests is unacceptable”) is used to keep the model from corrupting the JSON. Experiments showed JSON to be more robust than Markdown for this purpose.
Incremental Progress
With the scaffold in place, each coding‑agent iteration is asked to handle a single feature. After making code changes the agent commits to Git with a descriptive message and writes a summary to the progress file. This enables the model to recover from bad changes via Git and keeps the repository in a clean, merge‑ready state.
These practices also reduce wasted tokens because the agent no longer has to guess the current state of the project.
Testing
A common failure mode is the agent marking a feature as complete without proper testing. When no explicit test instruction is given, Claude may run curl commands that miss end‑to‑end failures. By explicitly prompting the agent to use a browser‑automation tool (Puppeteer MCP) and run the same tests a human would, end‑to‑end verification improves dramatically.
Providing such testing tools lets Claude recognize and fix bugs that are not obvious from the code alone. Limitations remain, such as Claude’s inability to see native browser modal alerts, which can cause missed bugs.
Quick Start
Run pwd to see the working directory; only files in that directory may be edited.
Read the Git log and claude-progress.txt to understand recent work.
Open the feature‑list JSON and pick the highest‑priority unfinished feature.
Before implementing a new feature, run the init.sh script to start the development server and execute a basic end‑to‑end test. In the claude.ai clone this means launching the local server, using Puppeteer MCP to start a new chat, send a message, and verify the response, ensuring the app is not left in a broken state.
Agent Failure Modes and Solutions
Problem: Claude declares the whole project finished too early.
Initialization agent behavior: Creates a structured JSON feature list based on the specification.
Coding agent behavior: Reads the feature list at session start and selects a single feature to work on.
Problem: Environment ends up with bugs or unrecorded progress.
Initialization agent behavior: Writes the initial Git repository and progress‑note file.
Coding agent behavior: Begins the session by reading the progress note and Git log, runs basic tests on the dev server, then writes a Git commit and progress update before ending.
Problem: Claude marks a feature as passed without verification.
Initialization agent behavior: Sets up the feature list.
Coding agent behavior: Self‑validates all features and only marks them "passed" after careful testing.
Problem: Claude spends time figuring out how to run the app.
Initialization agent behavior: Provides an init.sh script that can start the development server.
Coding agent behavior: Executes init.sh at session start, then proceeds with feature work.
Future Work
The study demonstrates a viable set of solutions for long‑running agent harnesses, but several open questions remain. It is unclear whether a single general‑purpose coding agent performs best across contexts or whether a multi‑agent architecture (e.g., dedicated testing, QA, or cleanup agents) yields better results. Extending these findings beyond full‑stack web development—to scientific research, financial modeling, or other long‑duration AI tasks—is a promising direction.
[Assistant] I'll start by getting my bearings and understanding the current state of the project.
[Tool Use] <bash - pwd>
[Tool Use] <read - claude-progress.txt>
[Tool Use] <read - feature_list.json>
[Assistant] Let me check the git log to see recent work.
[Tool Use] <bash - git log --oneline -20>
[Assistant] Now let me check if there's an init.sh script to restart the servers.
<Starts the development server>
[Assistant] Excellent! Now let me navigate to the application and verify that some fundamental features are still working.
<Tests basic functionality>
[Assistant] Based on my verification testing, I can see that the fundamental functionality is working well. The core chat features, theme switching, conversation loading, and error handling are all functioning correctly. Now let me review the tests.json file more comprehensively to understand what needs to be implemented next.
<Starts work on a new feature>Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ZhiKe AI
We dissect AI-era technologies, tools, and trends with a hardcore perspective. Focused on large models, agents, MCP, function calling, and hands‑on AI development. No fluff, no hype—only actionable insights, source code, and practical ideas. Get a daily dose of intelligence to simplify tech and make efficiency tangible.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
