12 Core Components of a Production-Grade Agent Harness and Framework Comparison
The article explains why production issues often stem from the agent harness rather than the model, defines the harness concept, breaks down its twelve essential components, shows a full execution loop, compares Anthropic, OpenAI, LangChain and other frameworks, and discusses key design trade‑offs for building robust AI agents.
Why problems are usually not in the model
When moving a chatbot or a simple ReAct loop into production, failures such as forgotten steps, tool‑call errors, or noisy context windows appear, and the common mistake is to blame the model. In reality, the surrounding infrastructure—called the agent harness —is often the root cause. Adjusting only the model while keeping the harness unchanged can dramatically improve rankings, as shown by TerminalBench 2.0 moving from outside the top 30 to 5th place.
What is an Agent Harness
It is more than a prompt wrapper
The term appeared officially in early 2026, but the idea existed earlier. Anthropic’s Claude Code SDK calls it the “agent harness” that drives Claude Code, and OpenAI’s Codex team uses the same wording. As Vivek Trivedy (LangChain) says, “If you’re not the model, you’re the harness.”
Agent vs. Harness
An Agent is the observable behavior—goal‑directed actions, tool usage, self‑correction. The Harness is the underlying machinery that orchestrates loops, registers tools, manages context, persists state, enforces guardrails, and runs verification. In practice, saying “I built an agent” really means “I built a harness and attached a model.”
Think of it as an operating system
The LLM is like a CPU with RAM (context window) and disk (external database). The harness functions as the OS, handling memory, I/O, and scheduling. Three concentric engineering layers surround the model: prompt engineering, context engineering, and harness engineering.
12 Components of a Production‑Grade Harness
Orchestration Loop – the heartbeat, typically a Thought‑Action‑Observation (ReAct) cycle: assemble prompt → invoke model → parse output → execute tool → inject result → repeat.
Tools – schemas describing name, description, and parameters; runtime concerns include registration, validation, sandboxed execution, and result formatting.
Memory – short‑term (session history) and long‑term (indexed stores, databases, files). Claude Code uses a three‑layer structure: lightweight index, topic files, raw records.
Context Management – prevents “lost in the middle” degradation by using compaction, observation masking, just‑in‑time retrieval, or sub‑agent delegation.
Prompt Assembly – stacks system prompt, tool definitions, memory files, conversation history, and user message, with OpenAI’s Codex prioritizing server‑side system messages.
Output Parsing – modern harnesses prefer native tool_calls instead of free‑form text, routing based on the presence of tool_calls.
State Persistence & Checkpoint – LangGraph models state as typed dictionaries, OpenAI offers SDK sessions and response chaining, Claude Code treats git commits as checkpoints.
Error Handling & Retry – calculates end‑to‑end success (e.g., 99% per step yields ~90% overall) and classifies errors into instant retries, LLM‑recoverable, user‑fixable, and unexpected failures.
Permissions & Guardrails – separate model intent from tool permission checks; Claude Code uses three‑stage checks (trust boundary, per‑call check, human confirmation), OpenAI splits guardrails into input, output, and tool layers.
Verification Loop – combines rule‑based tests, visual checks (e.g., Playwright screenshots), and LLM‑as‑judge to catch semantic errors; Claude Code reports 2‑3× quality gains with strong verification paths.
Sub‑Agent & Execution Models – fork, teammate, and worktree strategies for scaling beyond a single context window; OpenAI treats specialist agents as tools, LangGraph nests sub‑agents in state graphs.
Termination & Lifecycle – stops when no tool call, max rounds exceeded, token budget exhausted, tripwire triggered, user abort, or safety refusal.
How a Full Loop Runs
The seven steps are: (1) Prompt Assembly – combine system prompt, tool schema, memory, history, and user input; (2) LLM Inference – get text and/or tool_calls; (3) Output Classification – decide whether to continue, execute a tool, or hand off; (4) Tool Execution – validate, check permissions, run in sandbox, collect result; (5) Result Packaging – wrap result as an observation; (6) Context Update – append to history and trigger compaction if needed; (7) Loop – repeat.
Mainstream Frameworks Do the Same Thing
Anthropic
Provides a thin harness where the query() async iterator drives the agentic loop, keeping most intelligence inside the model.
OpenAI
Centers on a Runner that supports async, sync, and streamed modes; the harness is code‑first, exposing workflow logic directly in Python. Codex adds three layers: Core (agent code + runtime), App Server (JSON‑RPC API), and Client Surfaces (CLI, VS Code, Web).
LangGraph / LangChain
Explicitly models the harness as a state graph with nodes like llm_call and tool_node, allowing conditional routing based on the presence of tool_calls. Deep Agents now describe themselves as full harnesses with built‑in tools, planning, context, sub‑agents, and persistence.
CrewAI / AutoGen
CrewAI emphasizes role‑based decomposition (Agent, Task, Crew) with a deterministic flow layer, while AutoGen (Microsoft Agent Framework) offers various orchestration modes such as sequential, concurrent, group chat, and handoff.
All frameworks ultimately solve the same problem: making the model, tools, state, and verification loop work together reliably.
Why It’s Like Scaffolding
The harness does not produce intelligence itself but enables the model to act safely and consistently, much like scaffolding lets workers reach higher levels without being the building.
Design Principles
The harness should become thinner as models improve; if adding a stronger model forces more harness complexity, the design is flawed.
Models and harnesses co‑evolve; changing tool implementations can degrade performance because models often learn a specific harness during fine‑tuning.
7 Choices Every Harness Architect Faces
Single vs. Multi‑Agent – start with a single agent; split only when tool count or task domain justifies the routing overhead.
ReAct vs. Plan‑and‑Execute – ReAct offers flexibility at each step, while Plan‑and‑Execute separates planning and execution, yielding up to 3.6× speed gains.
Context Window Management – options include periodic clearing, summarization, observation masking, structured notes, or sub‑agent delegation; the best choice depends on cost vs. information loss.
Verification Loop Design – combine deterministic tests/linters with LLM‑as‑judge for semantic checks.
Permission Strategy – looser permissions increase throughput but raise risk; stricter policies add safety but may require human confirmation.
Tool Exposure – expose only the minimal set needed for the current task; lazy‑load additional tools as needed.
Harness Thickness – balance how much control logic is hard‑coded versus delegated to the model; trends show thinner harnesses as models get stronger.
Conclusion
Even with the same underlying model, two products can differ dramatically because of their harnesses. Production‑grade harnesses are still evolving; they must manage scarce context, enforce verification before failures cascade, provide reliable memory without hallucination, and find the right trade‑off between scaffolding and model autonomy. As models improve, harnesses will likely become thinner but will remain essential for robust AI agents.
AI Tech Publishing
In the fast-evolving AI era, we thoroughly explain stable technical foundations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
