Self‑Evolving Harness Engineering Propels GPT‑5.4 to a 7‑Point Gain, Securing a Global Top‑3 Spot

The paper introduces Agentic Harness Engineering (AHE), an observability‑driven framework that automatically evolves coding‑agent harnesses, boosting GPT‑5.4's pass@1 score on Terminal‑Bench 2 from 69.7% to 77.0% (+7.3 points), achieving a worldwide top‑three ranking and demonstrating strong cross‑task and cross‑model generalization.

Machine Heart
Machine Heart
Machine Heart
Self‑Evolving Harness Engineering Propels GPT‑5.4 to a 7‑Point Gain, Securing a Global Top‑3 Spot

Agentic Harness Engineering (AHE)

Agentic Harness Engineering (AHE) is an observability‑driven automatic optimization framework that spans the entire Harness Engineering workflow, enabling maximal model agency. The approach is implemented in the open‑source repository

https://github.com/china-qijizhifeng/agentic-Harness-engineering

and described in the paper Agentic Harness Engineering: Observability‑Driven Automatic Evolution of Coding‑Agent Harnesses ( https://arxiv.org/abs/2604.25850).

Motivation

Model capabilities evolve rapidly (monthly) while task distributions are long‑tailed, making manual Harness iteration costly. The core question is which parts of the Harness Engineering loop can be automated and how the Harness can learn from experience.

Observability Stack

AHE defines three roles that operate on a three‑layer observability stack:

Coding Agent – executes tests on a target model.

Agent Debugger – aggregates raw execution traces.

Evolve Agent – modifies the Coding Agent’s Harness based on evidence.

The stack consists of:

NexAU – provides decoupled Harness components with built‑in observability.

Agent Debugger – compresses ~10 M token raw traces into a hierarchical feedback report of ~10 K tokens for the Evolve Agent.

Evolve Agent – uses git‑tracked component history and the feedback report to construct evidence‑driven modification chains.

Component Observability

The Coding Agent runs on the NexAU framework. AHE splits the Harness into seven orthogonal file‑level components: System Prompt, Tool Description, Tool Implementation, Middleware, Skill, Sub‑agent Config, and Long‑term Memory. Each component resides in its own file with a clear mount point, enabling precise attribution of failures. All changes are version‑controlled with Git, making each commit traceable, auditable, and reversible. The initial Coding Agent starts from a minimal “zero‑prior” state containing only a run_shell_command tool.

Experience Observability

Each evaluation can generate tens of millions of tokens. Directly feeding this to the Evolve Agent would exceed its context window. Agent Debugger implements a three‑layer pipeline:

Bottom layer records the raw trace.

Middle layer runs a Cleaner that removes duplicate tool outputs.

Top layer uses a QA Sub‑agent to adapt questioning across multiple rollouts, producing a ~10 K token overview report for the Evolve Agent.

This progressive disclosure turns massive raw data into consumable, auditable experience assets.

Decision Observability

Modifications are restricted to Harness component files inside the workspace; the evaluation framework, LLM configuration, and original System Prompt are read‑only, preventing hacking.

Each change must include a “change list” detailing failed evidence, inferred root cause, targeted fix, and a self‑declared impact prediction. Subsequent evaluation validates the prediction; successful edits are retained, erroneous ones are rolled back.

This turns every Harness change into a testable hypothesis, shifting evolution from art to engineering.

Experimental Results

Using GPT‑5.4 as the underlying model, AHE raised the pass@1 score on Terminal‑Bench 2 from 69.7 % to 77.0 % (+7.3 percentage points, +10.5 % relative). This surpasses OpenAI’s official Codex‑CLI (71.9 %) and baselines such as ACE and Training‑Free‑GRPO.

Cross‑Task Generalization

When the evolved Harness was frozen and applied to SWE‑Bench Verified, AHE achieved higher success rates with fewer tokens than ACE and TF‑GRPO, demonstrating transferable engineering knowledge.

Cross‑Model Generalization

Applying the same Harness to Qwen‑3.6‑Plus, Gemini‑3.1‑Flash, and DeepSeek‑V4 yielded improvements of +5.1 to +10.1 percentage points, with larger gains on weaker models, indicating that the Harness captures model‑agnostic structural principles.

Insights and Failure Analyses

Early experiments on only 30 hard Terminal‑Bench 2 problems caused over‑fitting: the Evolve Agent introduced task‑specific hacks (e.g., Golden Gate splice‑offset detection). Expanding to the full 89‑problem set and adding explicit methodological prompts mitigated over‑fitting but introduced a performance ceiling at 75.3 % and concentrated edits in Middleware.

The final successful version made two key changes:

Each test was run twice; a partial‑pass diff identified precise diagnostic signals.

All handcrafted behavioral guidance was removed, leaving only evidence‑driven requirements and rollback rules.

Ablation experiments showed that removing the evolved Memory component alone recovered >95 % of the total gain, while Tool improvements helped medium‑difficulty tasks. Removing the System Prompt reduced performance, suggesting that factual components (Memory, Tools) transfer better than strategic prompts.

Component‑wise ablations on the evolved Harness revealed that Memory alone contributed the majority of the gain, Tool improvements were beneficial for medium‑difficulty tasks, and the System Prompt alone caused a performance drop.

Conclusion

When models are sufficiently capable, constructing a structured, observable evolution environment is more critical than hand‑crafting Harnesses. Providing the Evolve Agent with a clear workspace, explicit modification interfaces, and high‑quality feedback enables automatic convergence toward engineering‑level solutions without prescribing specific methodologies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

observabilityGPT-5.4Terminal-BenchAgentic Harness EngineeringCross-Model GeneralizationLLM Harness
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.