Agent Harness: A Deep Dive into AI Agent Architecture
The article defines Agent Harness as the full software infrastructure that wraps LLMs to enable stateful, tool‑using agents, breaks it down into twelve concrete components, compares implementations from Anthropic, OpenAI, LangChain and others, and outlines key engineering decisions that affect performance, safety and scalability.
What Is an Agent Harness?
Agent Harness is the complete software layer that surrounds a large language model (LLM) and turns a stateless model into a capable, autonomous agent. Anthropic’s Claude Code documentation calls the SDK “the agent harness that drives Claude Code,” and OpenAI’s Codex team treats the terms “agent” and “harness” as synonymous for the non‑model infrastructure that makes the model useful.
Vivek Trivedy of LangChain succinctly states, “If you are not the model itself, you are the harness.” In this view, the agent is the emergent behavior (goal‑directed, tool‑using, self‑correcting entity), while the harness is the machinery that produces that behavior.
Analogy and Motivation
Beren Millidge (2023) likens a bare LLM to a CPU without memory, storage, or I/O. The context window is fast but limited memory, an external database is the hard‑disk, tool integrations are device drivers, and the harness is the operating system. As models improve, the harness should become thinner, but it will never disappear because every powerful model still needs context management, tool execution, state persistence, and verification.
Three Engineering Layers Around the Model
Prompt‑engineering : designs the instructions the model receives.
Context‑engineering : decides what the model sees and when, preventing “context decay” where key information in the middle of the window drops performance by >30% (Chroma study, cited by Stanford’s “Lost in the Middle”).
Harness‑engineering : combines the first two layers with full application infrastructure—tool orchestration, state persistence, error handling, safety guardrails, and lifecycle management.
Production‑Grade Harness: Twelve Core Components
Orchestration Loop : implements the Think‑Act‑Observe (TAO) or ReAct cycle—assemble prompt, call LLM, parse output, invoke tools, feed results back, repeat. Anthropic describes this as a “simple while loop” where all complexity lives in the loop management.
Tools : defined by schema (name, description, parameters) and injected into the LLM context. Claude Code offers six tool categories (file ops, search, execution, web access, code intelligence, sub‑agent generation); OpenAI’s Agents SDK provides function tools (@function_tool), hosted tools (WebSearch, CodeInterpreter, FileSearch), and MCP server tools.
Memory : short‑term (dialogue history) and long‑term (persistent across sessions). Anthropic stores long‑term data in CLAUDE.md and MEMORY.md files; LangGraph uses a namespaced JSON store; OpenAI supports SQLite or Redis‑backed sessions.
Context Management : mitigates context corruption. Production strategies include compression (summarizing near‑limit history), observation masking (hiding old tool output while keeping calls visible), on‑demand retrieval (grep, glob, head/tail instead of loading full files), and sub‑agent delegation (returning 1‑2 k token summaries).
Prompt Construction : layers system prompt, tool schemas, memory files, dialogue history, and current user message. OpenAI’s Codex prioritises system messages, then tool definitions, developer instructions, user input (capped at 32 KiB), and finally conversation history.
Output Parsing : modern harnesses rely on the LLM returning a structured tool_calls object; if absent, the output is final. Structured responses can be constrained with Pydantic schemas; legacy parsers like RetryWithErrorOutputParser remain for edge cases.
State Management : LangGraph models state as a typed dictionary flowing through graph nodes, with reducers merging updates and checkpoints enabling resume and time‑travel debugging. OpenAI offers four mutually exclusive strategies (in‑app memory, SDK session, Conversations API, lightweight previous_response_id linking). Claude Code uses git commits as checkpoints and progress files as structured drafts.
Error Handling : a 10‑step flow with 99 % per‑step success yields ~90 % end‑to‑end success, illustrating error accumulation. Harnesses classify errors as instant (with back‑off), LLM‑recoverable (returned as tool messages), user‑fixable, or unexpected (bubbled up for debugging). Anthropic returns failed tool calls as error results; Stripe caps retries at two.
Safety Guardrails : OpenAI SDK implements input, output, and tool guardrails plus a circuit‑breaker that aborts the agent immediately. Anthropic separates permission execution from model reasoning, with three phases: trust establishment, per‑call permission check, and high‑risk user confirmation.
Verification Loop : the boundary between toy demos and production agents. Anthropic recommends rule‑based feedback, visual checks (Playwright screenshots), and LLM‑as‑judge sub‑agents. Claude Code’s creator Boris Cherny notes that giving the model a way to verify its own work can improve quality 2‑3×.
Sub‑Agent Orchestration : supports three execution models—Fork (byte‑level copy of parent context), Teammate (independent terminal panel with file‑based mailbox), and Worktree (each agent has its own git worktree). OpenAI treats agents as tools or as hand‑off targets; LangGraph nests sub‑agents as state‑graph nodes.
How a Full Loop Operates (Step‑by‑Step)
Prompt Assembly : Harness builds the full input—system prompt, tool schemas, memory files, dialogue history, and the new user message. Critical context is placed at the beginning and end to avoid “lost in the middle.”
LLM Inference : The assembled prompt is sent to the model API, which returns text, tool calls, or both.
Output Classification : If only text is returned, the loop ends; if a tool call is present, execution proceeds; if a hand‑off is requested, the current agent is swapped and the loop restarts.
Tool Execution : Harness validates parameters, checks permissions, runs the tool in a sandbox, and captures the result. Read‑only calls may run concurrently; write calls are serialized.
Result Packaging : Tool results are formatted as LLM‑readable messages; errors are returned as error results so the model can self‑correct.
Context Update : Results are appended to the dialogue history; when the context window nears its limit, compression is triggered.
Loop Continuation : Return to step 1 until a termination condition is met (no tool call, max rounds, token budget exhausted, guardrail trigger, user interrupt, or safety refusal).
For long‑running tasks that span multiple windows, Anthropic’s two‑stage “Ralph Loop” first establishes the environment (initial script, progress file, feature list, git commit) and then repeatedly reads the git log to locate the next high‑priority unfinished feature, committing a summary after each step.
Framework Implementations
Anthropic Claude Agent SDK : exposes the harness via a single query() function that returns an async iterator of streamed messages. The runtime is a “simple loop” with all intelligence inside the model.
OpenAI Agents SDK : provides a Runner class supporting async, sync, and streaming modes. The Codex harness adds three layers—Core (agent code + runtime), App Server (JSON‑RPC API), and UI (CLI/VS Code/web).
LangGraph : models the harness as an explicit state graph with nodes for LLM calls and tool nodes, replacing the deprecated AgentExecutor from LangChain. LangChain’s “Deep Agents” explicitly use the term “agent harness.”
CrewAI : implements role‑based multi‑agent orchestration (Agent, Task, Crew) with a “deterministic skeleton” that manages routing and verification.
AutoGen (Microsoft Agent Framework) : offers a three‑layer architecture (Core, AgentChat, Extensions) and five orchestration modes (sequential, concurrent fan‑out/in, group chat, hand‑off, and magentic task‑board coordination).
Key Design Decisions for Every Harness
Single vs. Multi‑Agent : Both Anthropic and OpenAI advise perfecting a single agent first; multi‑agent adds routing overhead and context loss. Split only when >10 overlapping tools or clearly independent task domains exist.
ReAct vs. Plan‑Execute : ReAct interleaves reasoning and action (flexible but higher per‑step cost). Plan‑Execute separates planning from execution; LLMCompiler reports a 3.6× speedup over sequential ReAct.
Context Window Management : Five production strategies—time‑based eviction, dialogue summarisation, observation masking, structured notes, sub‑agent delegation. ACON research shows that keeping reasoning traces while discarding raw tool output reduces token usage by 26‑54 % with >95 % accuracy.
Verification Loop Design : Compute‑based verification (tests, static checkers) offers deterministic ground truth; inference‑based verification (LLM as judge) captures semantic issues but adds latency. Martin Fowler’s ThoughtWorks framework separates “guides” (pre‑action prompts) from “sensors” (post‑action feedback).
Permission & Security Model : Choose between permissive (fast, higher risk) and restrictive (slow, safer) based on deployment scenario.
Tool Scope Strategy : More tools generally degrade performance. Vercel removed 80 % of tools and saw improvements; Claude Code uses lazy loading to cut context by 95 %.
Harness Thickness : Decides how much logic resides in the harness versus the model. Anthropic bets on a thin harness that shrinks as models improve; graph‑based frameworks bet on explicit control. Anthropic periodically removes planning steps from Claude Code as newer models internalise those capabilities.
Performance Impact
Changing only the harness—leaving the model weights untouched—moved an agent’s ranking from outside the top 30 to #5 on TerminalBench 2.0. Another research project let the LLM optimise its own infrastructure, achieving a 76.4 % success rate, surpassing manually engineered systems.
Conclusion
Agent Harness is the hard‑core engineering layer that makes autonomous agents reliable, scalable, and safe. Even as LLMs become more capable, a well‑designed harness remains essential for managing scarce context, orchestrating tool calls, persisting state, and verifying work. When an agent fails, the first place to look is not the model but the harness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
