Can Agents Self‑Improve Their Harness? Designing a Self‑Harness Architecture
The article presents Self‑Harness, an engineering‑focused framework that lets AI agents analyze their execution traces, propose limited harness edits, and retain only those changes that pass regression tests, demonstrating measurable held‑out pass‑rate gains across three models while emphasizing reliable fact sources and staged adoption.
Problem Overview
Many long‑running agents encounter repeated failures such as missing artifacts, endless retries, or lost environment variables. The root cause is often the harness—the collection of task protocol, tool description, state management, permission boundaries, validation rules, retry strategy, and artifact policy.
Self‑Harness Concept
Self‑Harness, introduced by Shanghai AI Lab, treats the harness as a change‑managed system rather than a static prompt. The architecture isolates the model, evaluator, tools, and budget, allowing only the harness to evolve.
Fixed model
Fixed evaluator
Fixed tools and budget
Only let Harness changeFour‑Layer Architecture
1. Task Execution Layer – the agent reads input, calls tools, writes files, runs tests, and delivers results.
2. Evidence Layer – after a failure the system records which step failed, what artifact was missing, why the verifier rejected, and which harness version was active.
3. Proposal Layer – the agent can edit only the declared harness surface (task protocol, tool description, retry policy, etc.) and must explain the target failure, edited surface, expected effect, and regression risk.
4. Promotion Layer – candidate edits are re‑evaluated; a modification is accepted only if at least one held‑out split improves and no split degrades.
Which Harness version was used this round
Where did the task stop
Where were tool calls and file changes
Did the artifact really remain
Which verifier gave the failure reason
Which historical evidence will the next proposal seeExperimental Evaluation
The experiments run on the Terminal‑Bench‑2.0 benchmark with three model families: MiniMax M2.5, Qwen3.5‑35B‑A3B, and GLM‑5. The held‑out pass‑rate improvements are:
MiniMax M2.5: 40.5 % → 61.9 % (+21.4 pp)
Qwen3.5‑35B‑A3B: 23.8 % → 38.1 % (+14.3 pp)
GLM‑5: 42.9 % → 57.1 % (+14.2 pp)
All gains stem from harness edits; model, tool, and budget remain unchanged. Moreover, no edit caused degradation on either held‑in or held‑out split, addressing a common pitfall of local optimization.
Model‑Specific Failure Modes
Analysis of the accepted edits reveals distinct “diseases”:
MiniMax M2.5 tended to produce late artifacts and over‑explore the dataset. The new harness forces early artifact creation and switches to verification once tool calls become costly.
Qwen3.5 repeatedly failed on file creation and overwriting. The harness adds dependency pre‑checks, aborts futile retries, and introduces artifact‑recovery logic.
GLM‑5 struggled with environment persistence and long downloads. The harness ensures cross‑shell state retention and redirects prolonged exploration toward implementation and testing.
These cases illustrate that a single initial harness exposes different weaknesses across models, reinforcing that self‑harnessing is more than a universal “stronger prompt”.
Reliance on Reliable Fact Sources
Self‑Harness requires complete execution traces, artifact persistence, harness version logging, and checkpoint storage. Missing any of these makes weakness mining impossible, reducing the system to human recollection.
The article also outlines a practical rollout plan for teams:
Start with low‑risk tasks (CI failure classification, documentation link repair, test artifact checks, dependency pre‑checks, duplicate error clustering).
Record full traces: input, harness version, tool calls, file changes, test output, verifier result, final artifact, stop reason.
Classify failures into concrete categories (missing artifact, repeated command failure, dependency missing, exploration without output, test failure not fixed).
Expose a limited harness surface (task protocol, tool description, middleware, verifier guidance, retry policy, artifact policy, state ledger format) while keeping permissions, production config, billing, and external connectors manual.
Require each candidate edit to include target failure, edited surface, expected effect, and regression risk.
Automate promotion with regression gates: at least one split must improve, no split may degrade, artifacts must exist, budget must not worsen, and high‑risk changes go through human review.
Relation to Prior Work
Self‑Harness builds on Loop Engineering (feedback loop of trigger‑execute‑verify‑state‑stop), How We Claude Code (pre‑move specifications and verification), and Fable 5 (harness as a first‑class component). Compared with Meta‑Harness, Agentic Harness Engineering, and Adaptive Auto‑Harness, Self‑Harness emphasizes a strict promotion gate and observable fact sources.
Takeaways
Self‑Harness demonstrates that agents can safely self‑improve when the system records reliable evidence, proposes bounded edits, and validates them through regression testing. The approach turns the harness into a versioned, auditable, and roll‑backable component, a capability that will become essential for long‑running AI‑driven workflows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
