Artificial Intelligence 14 min read

Why Code Is the Core of Agent Harness: Deep Insights from UIUC, Meta, and Stanford

The article explains how code serves as the executable, inspectable, and stateful medium that links reasoning, action, feedback, verification, and collaboration in long‑term AI agents, detailing the harness interface, planning‑execute‑verify loop, multi‑agent coordination, and open research challenges.

Machine Learning Algorithms & Natural Language Processing

Jun 10, 2026

Why Code Is the Core of Agent Harness: Deep Insights from UIUC, Meta, and Stanford

Why Harness Needs Code as Its Core Carrier

Large‑language‑model coding agents have moved beyond merely writing code; they must read repositories, plan, modify files, run commands, inspect errors, and maintain context over long horizons. The execution system that makes such agents reliable is called an Agent Harness .

Recent 102‑page survey Code as Agent Harness by UIUC, Meta, and Stanford asks: when an agent operates in a long‑term task environment, what object ties together reasoning, action, feedback, verification, and collaboration? The answer is code .

Code as the Central Medium

In this context, code is not just the final program generated by the model, nor the code that implements the harness itself. It refers to the series of code‑based artifacts the agent continuously creates, runs, modifies, saves, and shares: Plan.md, test scripts, shell commands, patches, execution logs, workflows, skill libraries, simulators, validators, and even the shared repository state.

Traditional code generation treats code as the end product. In an Agent Harness, code becomes part of the execution loop, carrying plans, execution, feedback, verification, and state management, thus becoming the most stable and manipulable state carrier.

Why Code Fits the Harness Interface

A pure LLM is stateless; it generates the next token based on context but does not preserve task progress or external state. Harness connects the model to a real execution environment.

Code possesses three properties absent in natural language:

Executable : Model intent can be turned into real actions such as shell commands, patches, or test scripts.

Inspectable : Execution produces objective feedback—compile errors, runtime errors, test results, logs, and traces.

Stateful : Progress can be persisted in repositories, file systems, configurations, test results, commit histories, and skill libraries.

Therefore, this survey differs from previous harness overviews by placing code at the center: code is the most stable, operable state carrier in a harness.

How Code Bridges the Harness Interface

Reasoning becomes executable : Earlier agents relied on natural‑language chain‑of‑thought reasoning, which is hard to verify. Methods like PoT, PAL, and proof assistants (Lean/Coq) externalize reasoning as programs that interpreters can execute.

Action becomes concrete : Systems such as Claude Code, Codex, SayCan, Code as Policies, and Voyager translate language goals into file modifications, test runs, error inspection, and patch generation.

Environment becomes modelable : Agents need a representation of the external world. Repositories, test outcomes, execution logs, DOM trees, simulators, and data‑analysis scripts serve as structured views of the environment. Benchmarks like SWE‑bench and AgentBench embody this principle.

Once code enters the harness interface, reasoning, action, and environment are no longer abstract text but executable, inspectable, and updatable state.

Managing State and Feedback with Code

Real tasks rarely finish in a single step; bug fixing may require multiple locate‑modify‑test‑rollback cycles, and complex workflows span many tools. The key is not a stronger model but an agent whose steps are organized into a controllable execution loop.

The core loop is Plan‑Execute‑Verify :

Planning becomes concrete artifacts such as Plan.md, workflows, or executable task graphs.

Memory extends beyond larger context windows to include stored repository evidence, logs, failure experiences, and compressed patches.

Tool use moves from simple API calls to terminal commands, sandboxes, test frameworks, and static analyzers that modify the external world.

Systems like SWE‑agent and OpenHands illustrate this loop by repeatedly performing “write code → run → fail → fix” as a state transition process. Errors, test failures, and logs act as feedback sensors that guide the agent toward convergence.

Code as the Shared Foundation for Multi‑Agent Collaboration

When a task exceeds a single agent’s capability, multiple agents assume roles such as manager, planner, coder, tester, and reviewer. The challenge is not merely “more models talking” but how they share a common world state.

If agents only exchange chat logs, their understanding of the codebase diverges, leading to inconsistent views of modifications, test outcomes, and execution traces. Shared code artifacts—repositories, tests, pull requests, CI logs, review comments, and execution traces—provide a stable common ground.

Thus, the shared language of a multi‑agent system should be executable, versioned code rather than natural‑language dialogue.

From Claude Code to General Agent Operating Systems

Code‑centric harnesses first emerged in coding agents because software is naturally executable, testable, roll‑backable, and recordable. However, the same principle extends to GUI/OS agents (where DOM trees, accessibility trees, and Playwright scripts become programmable), robotics (where intents map to skill libraries, control scripts, and simulation feedback), and scientific discovery (where hypotheses, simulations, data analysis, and experiment records form code pipelines).

Future agents, even those not explicitly called “coding agents,” will likely run on some form of code‑centric harness.

Open Problems: Evaluating the Next Generation of Agents

Long‑term agents require new evaluation metrics beyond final outcomes. An agent may pass a test yet make dangerous modifications or hidden regressions; another may fail the task but provide a clear, recoverable execution trace. Benchmarks must therefore assess planning, tool invocation, state transitions, and feedback usage.

Key open questions include:

How to perform harness‑level evaluation that measures intermediate plans, tool calls, and state changes?

How to handle incomplete feedback where passing tests does not guarantee correctness?

How to achieve regression‑free self‑evolution without introducing new failure modes?

How to resolve semantic conflicts when multiple agents share state?

How to make human‑in‑the‑loop interactions auditable, accountable, and verifiable?

The next step for AI agents is not merely better answering but making the entire code‑driven execution process more inspectable, recoverable, and governable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Agent Evaluation Multi-Agent Collaboration Long-term Planning Agent Harness Code as Interface

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.