Can Harnesses Self‑Evolve? Fudan & Peking University’s Agentic Harness Engineering Breakthrough
The paper introduces Agentic Harness Engineering (AHE), showing that a 10‑round evolution improves Coding Agent pass@1 from 69.7% to 77.0% on Terminal‑Bench 2—outperforming Codex‑CLI—and that the evolved harness transfers zero‑shot to SWE‑bench and multiple model families, thanks to three observability pillars.
1. Overlooked Harness Engineering Bottleneck
Coding Agent performance depends not only on the base LLM but also on the surrounding harness—system prompts, tools, middleware, skills/sub‑agents, and long‑term memory. Current harness design follows a manual workshop model where developers read massive logs, identify failure patterns, and manually tweak prompts or tools, which cannot keep pace with rapid model upgrades (e.g., GPT‑5.4, DeepSeek‑V4, Qwen‑3.6).
2. AHE Core Design: Three Observability Pillars
AHE’s insight is that the bottleneck lies in observability, not agent intelligence. By providing structured context and a clear action space, an evolution agent can reliably converge to better harness designs.
2.1 Component Observability – File‑Level Decoupling
AHE builds on the NexAU framework, exposing the harness as seven orthogonal file‑based components (System Prompt, Tool Description & Implementation, Middleware, Skill, Sub‑agent Configuration, Long‑term Memory). Each failure maps to a single component, enabling git‑style commits, diffs, and rollbacks without touching unrelated parts.
2.2 Experience Observability – Hierarchical Trace Distillation
Raw traces (millions of tokens) are treated as navigable files. The Agent Debugger parses them with generic shell/script tools and produces two reports: per‑task root‑cause analysis and a benchmark‑level overview, allowing progressive disclosure that saves tokens while keeping decisions evidence‑based.
2.3 Decision Observability – Verifiable Edit Contracts
Each evolution step reads the hierarchical evidence and decides which components to edit. AHE enforces two contracts:
Controllability: edits are confined to the harness workspace; the seed system prompt is read‑only, preventing shortcuts such as disabling validators or swapping models.
Self‑declared prediction: every edit includes a manifest recording failure evidence, inferred root cause, proposed fix, and predicted impact (tasks to be repaired and possible regressions). After rollout, predicted impacts are intersected with actual task‑level deltas to produce a verdict, replacing self‑justification with cross‑round empirical validation.
3. Experimental Results
3.1 RQ1 – Does AHE surpass human‑designed and automated baselines?
Running 10 rounds on Terminal‑Bench 2 (89 tasks, k=2 rollouts per task) took ~32 hours. AHE raised pass@1 from 69.7% to 77.0%, beating human‑crafted Codex‑CLI (71.9%) and automatic baselines ACE (68.9%) and TF‑GRPO (72.3%). On Easy and Medium splits AHE leads; on Hard it trails Codex‑CLI due to non‑additive component interactions.
3.2 RQ2 – Does the evolved harness overfit?
The evolved harness, trained on GPT‑5.4 high + Terminal‑Bench 2, was zero‑shot transferred to two settings:
SWE‑bench‑verified: AHE achieved the highest overall success rate while using 12% fewer tokens than the seed.
Cross‑model transfer: applying the harness to five different base models (GPT‑5.4 medium, high, xhigh; Gemini‑3.1‑flash‑lite; DeepSeek‑v4‑flash; Qwen‑3.6‑plus) yielded consistent gains ranging from +2.3 to +10.1 percentage points, with larger improvements on less‑saturated models, indicating that AHE captures a general coordination pattern rather than model‑specific prompt tricks.
3.3 RQ3 – Where do the gains come from? Is self‑attribution reliable?
Component ablation (re‑injecting each component into the seed harness) shows:
Long‑term Memory: +5.6 pp
Tools: +3.3 pp
Middleware: +2.2 pp
System Prompt: –2.3 pp (regression)
The three positive contributions sum to +11.1 pp, yet the full AHE only adds +7.3 pp because interactions are non‑additive: combined components increase verification loops, causing redundant retries on Hard tasks. Fix prediction precision/recall are 33.7%/51.4% (random baselines 6.5%/10.6%), showing the evolution agent’s edits are evidence‑driven and about five times better than random. Regression prediction is weaker (precision 11.8%, recall 11.1% vs random 5.6%/5.4%), identified as the main limitation and a target for future work. https://arxiv.org/pdf/2604.25850 Paper: Agentic Harness Engineering: Observability‑Driven Automatic Evolution of Coding‑Agent Harnesses
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
