Agentic Harness Engineering Enables Agents to Self‑Evolve and Outperform Codex in 10 Rounds
The Agentic Harness Engineering (AHE) framework lets coding agents automatically read massive execution traces, identify failure patterns, and iteratively modify harness components—prompt, tools, middleware, and memory—achieving a pass@1 increase from 69.7% to 77.0% and surpassing human‑tuned Codex‑CLI after ten automated evolution rounds.
Agentic Harness Engineering (AHE) Overview
Real‑world software‑engineering tasks generate execution logs of millions of tokens and a large heterogeneous action space, making manual debugging of coding‑agent harnesses difficult.
AHE first decomposes a harness into editable components—system prompt, tool implementations, middleware, and long‑term memory—using the NexAU framework. Each component resides in a separate file, providing a clear modification entry point.
To make massive trajectories tractable, AHE employs an Agent Debugger that compresses raw logs into a hierarchical evidence corpus, enabling the evolution agent to query structured failure causes without scanning the full log.
When proposing a change, the agent must emit a change list that specifies the tasks the edit is intended to fix and any tasks that might regress, allowing file‑level rollback after evaluation.
Experimental Evaluation on Terminal‑Bench 2
Starting from the NexAU seed, AHE performed 10 automated evolution iterations.
Pass@1 improved from 69.7 % to 77.0 %, surpassing the human‑engineered Codex‑CLI harness (71.9 %) and the ACE (68.9 %) and TF‑GRPO (72.3 %) baselines.
Component ablation (re‑injecting each evolved component into the original seed) showed:
Updating long‑term memory contributed +5.6 %.
Updating tools contributed +3.3 %.
Updating middleware contributed +2.2 %.
Replacing only the system prompt reduced pass@1 to 67.4 %, indicating that structural changes, not longer prompts, drive the gains.
Transfer to SWE‑bench‑verified
After freezing the evolved harness, AHE was evaluated on SWE‑bench‑verified without further tuning.
Achieved the highest overall success rate among compared methods.
Reduced average token consumption by 12 % relative to the seed harness.
Improved cost‑efficiency (Succ/Mtok) compared with ACE and TF‑GRPO.
When the same harness was integrated with various foundation models—GPT‑5.4, DeepSeek‑v4‑flash, Qwen‑3.6‑plus, Gemini‑3.1‑flash‑lite—each model gained 2.3 %–10.1 % higher pass@1, demonstrating cross‑model portability of the learned harness.
Predictive Capability of the Evolution Agent
During evolution, the agent predicted the effect of its edits:
Repair‑task prediction: precision 33.7 %, recall 51.4 % (well above random baseline).
Regression‑risk prediction: recall 11.1 %, indicating limited foresight of potential regressions and explaining occasional score fluctuations.
Key Takeaways
AHE turns a static prompt into a learnable, observable artifact. By making the evolution process traceable, verifiable, and rollback‑able, it demonstrates that coding‑agent performance can be improved through systematic harness engineering rather than merely extending prompts.
Code repository:
https://github.com/china-qijizhifeng/agentic-harness-engineeringSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
