First Survey of Agent Harnesses: What Powers Agents Beyond the Model?
The article surveys recent research on Agent Harness engineering, showing that real‑world agent instability stems from system‑level factors beyond model capability, introduces the seven‑layer ETCLOVG architecture, presents benchmark gains from harness tweaks, maps open‑source projects to the framework, and outlines five key open research directions.
Agent systems often become unstable when deployed on real‑world tasks, and the instability cannot be explained solely by limited model capability.
In standard benchmarks models produce high‑quality answers and demonstrate strong tool use, planning, and coding abilities, but in production environments tasks are longer, tool sets grow, state accumulates across rounds, and execution results depend on file systems, browsers, terminals, code repositories, and permission systems.
Benchmark Improvements from Harness Modifications
Without changing model weights, adjusting tool‑format editing and surrounding harness yields up to a 10× boost on coding benchmarks.
Fixing a fixed GPT‑5.2‑Codex model with system‑prompt reconstruction, middleware context injection and self‑verification raises Terminal‑Bench 2.0 scores from 52.8% to 66.5%.
Meta‑Harness automates harness optimization to achieve 76.4% on Terminal‑Bench‑2.
These results show that system‑level design can dramatically change benchmark performance for executable agents.
Shift in Engineering Focus
Agent engineering has migrated through three stages: prompt engineering (optimizing single inputs), context engineering (managing multi‑step information flow), and harness engineering, which integrates execution environments, tool interfaces, state maintenance, feedback validation, and governance into a unified system.
Three Core Themes of the Survey
Agent Harness should be treated as an independent system layer; real reliability is shaped by execution control, feedback loops, governance, evaluation, and operations design.
The ETCLOVG seven‑layer architecture (Execution, Tooling, Context, Lifecycle, Observability, Verification, Governance) places these components in a single framework.
An open‑source sample repository maps over a hundred public projects and technical items onto this architecture, revealing dense and sparse areas of the current ecosystem.
ETCLOVG Seven‑Layer Architecture
Execution (E) : Sandboxes provide security isolation, reproducibility, and continuous execution. Types include generic hosted sandboxes, computer environments, code‑specific sandboxes, framework runtimes, browser evaluation environments, OS‑level permission sandboxes, and abstract sandbox layers.
Tooling (T) : Defines how agents discover, describe, and invoke external capabilities. Standards such as MCP, A2A, function calling, OpenAPI, and AGENTS.md address different interface concerns and jointly support tool usage, external capability integration, cross‑agent collaboration, and repository‑level constraints.
Context (C) : Addresses long‑task challenges beyond window size, including context decay and drift. Discussed mechanisms include KV caches, structured notes, file‑based planning, and vector or graph memory systems.
Lifecycle (L) : Manages task flow across multiple model and tool calls, maintaining execution state. Explores trade‑offs between stateless replay and stateful execution, and extends orchestration from single‑agent loops to multi‑agent collaboration and full issue‑to‑pull‑request pipelines.
Observability (O) : Captures execution traces, monitoring, cost statistics, and reliability engineering. Systems such as Langfuse, OpenTelemetry, OpenLLMetry, Phoenix, and MLflow illustrate how agent runs are turned into traceable, measurable, and diagnosable signals.
Verification (V) : Defines a five‑stage task‑to‑feedback lifecycle: benchmark alignment, pre‑execution checks, controlled execution, multi‑level judgment (result, trace, evaluator), and continuous regression. Verification itself becomes part of the feedback loop.
Governance (G) : Constrains agent behavior at model, system, and organization layers, covering permission and identity management, lifecycle hooks, component hardening, declarative constitutions, audit logs, and human‑in‑the‑loop mechanisms.
Open‑Source Project Sample Library
The authors compiled a public sample library at https://github.com/Picrew/awesome-agent-harness containing projects from GitHub, papers, curated lists, package registries, and engineering blogs. Each project is classified into the seven layers, revealing distribution counts: Execution 20, Tooling 12, Context 9, Lifecycle 47, Observability 15, Verification 21, Governance 14.
Cross‑Layer Challenges and Open Directions
Harness design must balance cost, quality, and speed, as well as capability versus control. Optimizing a single component (prompt, tool, sandbox, verifier, or monitor) can alter overall system behavior when placed in a full execution loop, explaining the shift from framework‑centric to platform‑centric agent ecosystems, where platforms add persistent workspaces, identity management, observability, evaluation, governance, and human hand‑off mechanisms.
The paper proposes five open research directions:
Strengthen and extend the execution base to address sandbox escape, prompt injection, multi‑tool risks, and consistency across deployment modes.
Maintain reliable state for long‑running tasks, mitigating compression, retrieval, forgetting, and update drift.
Diagnose failures based on execution traces, distinguishing model, tool, environment, and evaluation failures.
Standardize hand‑off mechanisms between agents, tools, and humans to reliably transfer context, state, permissions, artifacts, and unresolved decisions.
Adapt harnesses as model capabilities evolve, identifying which mechanisms remain necessary and which become redundant overhead.
For long‑task agents, while the underlying model remains important, the engineering quality of the harness has become a critical factor for real‑world reliability. Understanding trade‑offs across layers and reassessing necessary mechanisms as models improve is essential for building production‑grade agent systems.
Code example
本文
约3000字
,建议阅读
5
分钟
可靠性转向系统工程。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
