Why Identical LLMs Behave So Differently: Inside the Agent Harness Architecture
The article dissects the Agent Harness concept—covering its definition, three engineering layers, twelve production‑grade components, detailed orchestration loops, context‑management tricks, verification strategies, and how frameworks like Anthropic, OpenAI, LangChain, CrewAI and AutoGen implement these patterns, revealing why the same model can yield wildly different results.
What Is an Agent Harness?
First coined in early 2026, an Agent Harness is the complete software infrastructure that wraps an LLM, handling orchestration loops, tools, memory, context management, state persistence, error handling and safety guardrails. Anthropic describes its SDK as “the Agent Harness that drives Claude Code,” while OpenAI’s Codex team equates the terms “Agent” and “Harness” to refer to the non‑model infrastructure that makes an LLM usable.
Vivek Trivedy of LangChain famously defines it succinctly: “If you’re not the model, you’re the Harness.” The distinction is that an Agent is the emergent, goal‑directed behavior, whereas the Harness is the machinery that enables that behavior.
Three Engineering Layers
Prompt Engineering : crafting the exact instructions the model receives.
Context Engineering : deciding what the model sees and when.
Harness Engineering : combines the above and adds tool orchestration, state persistence, error recovery, validation loops, safety, and lifecycle management.
The Harness is not a simple wrapper around prompts; it is the full system that makes autonomous agents possible.
12 Core Components of a Production‑Grade Harness
The Orchestration Loop : Implements the Think‑Act‑Observe (TAO) or ReAct cycle—assemble prompt → call LLM → parse output → execute tool → feed result back → repeat. Complexity lies in loop management, not the loop itself.
Tools : Defined by name, description, and parameter types, injected into the LLM context. They handle registration, schema validation, sandboxed execution, result capture and formatting. Claude Code offers six tool categories (file ops, search, exec, web, code‑intelligence, sub‑agent generation); OpenAI’s SDK supports function tools, hosted tools and MCP server tools.
Memory : Operates on short‑term (single‑session dialogue) and long‑term (cross‑session persistence). Anthropic stores CLAUDE.md and MEMORY.md files; LangGraph uses namespaced JSON stores; OpenAI can back memory with SQLite or Redis.
Context Management : Prevents “context decay” where critical information in the middle of the window drops model performance by >30% (Chroma study, corroborated by Stanford’s “Lost in the Middle”). Production strategies include compaction, observation masking, just‑in‑time retrieval, and sub‑agent delegation.
Prompt Construction : Hierarchical assembly of system prompt, tool schemas, memory files, dialogue history and the current user message. OpenAI’s Codex uses a strict priority stack (system message → tool definitions → developer instructions → user message → history).
Output Parsing : Modern Harnesses rely on structured tool_calls objects rather than free‑text parsing. If no tool call is present, the loop terminates; otherwise, the tool is executed and the cycle continues. Pydantic models can enforce response schemas.
State Management : LangGraph models state as typed dictionaries flowing through graph nodes, with reducers merging updates. Checkpoints are created at “super‑step” boundaries for resume and time‑travel debugging. OpenAI offers four mutually exclusive strategies (SDK Sessions, Conversations API, etc.). Claude Code uses git commits as checkpoints.
Error Handling : A 10‑step process with 99% per‑step success yields only ~90.4% end‑to‑end success, so errors accumulate quickly. LangGraph classifies errors into transient (with back‑off), LLM‑recoverable (returned as tool messages), user‑fixable (awaiting input) and unexpected (bubbled up). Anthropic returns failures as error results; Stripe limits retries to two.
Guardrails and Safety : OpenAI SDK implements input, output and tool guardrails; a “tripwire” aborts the Agent instantly. Anthropic separates permission execution from model reasoning, managing ~40 discrete tool capabilities through staged trust, pre‑call checks, and explicit user confirmation for high‑risk actions.
Verification Loops : Critical for production agents. Anthropic recommends rule‑based feedback, visual UI checks (Playwright screenshots), and LLM‑as‑judge sub‑agents. Claude Code’s creator Boris Cherny notes that giving the model a way to verify its work can improve quality 2‑3×.
Sub‑Agent Orchestration : Claude Code supports three execution models—Fork (byte‑level copy), Team (independent terminal panels with file‑based mailbox), and Worktree (each Agent has its own git worktree). OpenAI’s SDK adds agents‑as‑tools and handoffs; LangGraph treats sub‑agents as nested state graphs.
Termination Conditions : Hierarchical checks—no tool call, max rounds, token budget exhausted, guardrail tripwire, user interrupt, or safety refusal. Simple queries may finish in 1‑2 rounds; complex refactoring can span dozens of rounds and many tool calls.
Full Loop Walk‑through
1. Prompt Assembly : Harness builds the full input (system prompt, tool schemas, memory files, history, user message) placing critical context at the start and end to avoid “lost in the middle.”
2. LLM Inference : The assembled prompt is sent to the model API, which returns text, tool calls, or both.
3. Output Classification : Pure text ends the loop; a tool call triggers execution; a handoff updates the active Agent and restarts.
4. Tool Execution : Parameters are validated, permissions checked, sandboxed execution runs, and results captured. Read‑only calls may run concurrently; mutating calls are serialized.
5. Result Wrapping : Tool results are formatted for LLM consumption; errors are returned as error messages for self‑correction.
6. Context Update : Results are appended to dialogue history; if the context window nears its limit, the Harness triggers compaction.
7. Loop : Return to step 1 until a termination condition is met.
Framework Implementations
Anthropic’s Claude Agent SDK exposes a query() function that returns an async iterator streaming messages. Its runtime is a “dumb loop” with all intelligence residing in the model; the Harness manages collect‑act‑verify cycles.
OpenAI’s Agents SDK provides a Runner class supporting async, sync and streaming modes. The Codex Harness adds three layers: Core (Agent code + runtime), App Server (JSON‑RPC API) and client UI (CLI, VS Code, web). This shared Harness explains why Codex performs better in its native UI than in generic chat windows.
LangGraph models the Harness as an explicit state graph with nodes for LLM calls and tool execution, evolving from LangChain’s AgentExecutor (deprecated in v0.2). LangChain’s Deep Agents explicitly use the “Agent Harness” term.
CrewAI implements role‑based multi‑Agent architecture (Agent + Task + Crew) with a Flows layer that adds deterministic routing and verification.
AutoGen (evolving into Microsoft Agent Framework) introduces a three‑layer architecture (Core, AgentChat, Extensions) supporting sequential, concurrent, group‑chat, handoff and “magentic” orchestration modes.
Scaffold Metaphor
The scaffold is a temporary infrastructure that lets workers reach otherwise inaccessible structures; it does not perform the building itself. As models improve, Harness complexity should shrink—Manus was rewritten five times in six months to remove unnecessary layers, turning complex tool definitions into generic shell execution and simplifying agent handoffs.
This leads to the “co‑evolution principle”: models are fine‑tuned on the specific Harness they run on, so changing tool implementations can degrade performance.
Seven Design Decisions for Every Harness
Single Agent vs Multi‑Agent – start with a single Agent; split only when tool overload (>≈10 overlapping tools) or clearly independent task domains appear.
ReAct vs Plan‑Execute – ReAct interleaves reasoning and action (flexible but costlier); Plan‑Execute separates planning, yielding up to 3.6× speedup (LLMCompiler).
Context‑Window Strategy – five production methods (time‑based eviction, dialogue summarization, observation masking, structured notes, sub‑Agent delegation) reduce tokens by 26‑54% while keeping >95% accuracy (ACON study).
Verification Loop Design – deterministic test/linter checks vs. LLM‑as‑judge; Fowler’s ThoughtWorks frames them as “guidance” (pre‑action) and “sensors” (post‑action).
Permission & Security – trade‑off between permissive fast execution and strict approval workflows, chosen per deployment context.
Tool‑Scope Strategy – more tools often hurt performance; Vercel removed 80% of tools and saw gains; Claude Code’s lazy loading cuts context by 95%.
Harness Thickness – balance logic between Harness and model. Anthropic favors thin Harnesss and lets model improvements absorb functionality; graph‑based frameworks keep more explicit control.
Conclusion
Two products using the same LLM can differ dramatically in performance solely because of Harness design; TerminalBench shows that swapping Harness alone can move an Agent up 20+ ranking positions. The Harness remains the hardest engineering problem: managing scarce context, designing verification loops before error accumulation, building reliable memory, and deciding how much scaffolding to retain versus offloading to the model.
As models get stronger, Harnesses will become thinner but never disappear—every powerful model still needs infrastructure to manage its context window, execute tools, persist state and verify work.
AI Code to Success
Focused on hardcore practical AI technologies (OpenClaw, ClaudeCode, LLMs, etc.) and HarmonyOS development. No hype—just real-world tips, pitfall chronicles, and productivity tools. Follow to transform workflows with code.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
