Artificial Intelligence 19 min read

Why Agent Harnesses Outperform Models: The Power of Scaffolding in AI Agents

This article examines how the design of agent harnesses—simple loops with atomic tools and progressive disclosure—determines the performance ceiling of AI agents, showing that optimized scaffolding can double success rates, cut token usage by up to 47%, and outweigh model selection.

High Availability Architecture

Mar 2, 2026

Why Agent Harnesses Outperform Models: The Power of Scaffolding in AI Agents

What Is an Agent Harness?

According to LangChain founder Harrison Chase, a framework is abstract and neutral, while a harness is a plug‑and‑play solution that includes a full stack of capabilities. An agent harness comprises everything that surrounds a model to make it useful: execution loops, tool definitions, error recovery, state management, and information flow. The model decides what to do; the harness decides what the model can see, which tools it can use, and how to handle failures.

All production‑grade agents converge on a core loop:

while (model returns tool call):
    execute tool → capture result → append to context → call model again

This simple loop powers Claude Code, Cursor, and Manus agents; the engineering challenge lies in what is built around it.

How Major Companies Build Their Harnesses

1. Claude Code – Model‑Control Loop

Loop Mechanism : Uses a flat message list and a while(tool_call) loop (named nO) without complex DAG orchestration.

Atomic Tools : About 18 primitives divided into command‑line discovery, file interaction, web access, and orchestration. Preference is given to primitives over integrated solutions (e.g., using regex instead of a vector DB for code search).

Information Layering : Loads six layers at session start – organization policy, project‑level CLAUDE.md, user settings, auto‑learned MEMORY.md, session history, and Git status. A key pattern is injecting system reminders after each tool execution.

Error Recovery : Errors are returned as tool results; the model decides how to respond.

TodoWrite Trick : A no‑op tool that records plans, allowing the model to revisit a TODO list when errors occur, keeping the agent on track.

2. Cursor – Files as Primitives

Model‑Specific Fine‑Tuning : Each frontier model receives a custom harness with tailored tool names, prompts, and behavior (e.g., rg for Codex, different summary formats for Claude).

Files as Primitives : All components map to files, enabling powerful search (ripgrep, jq), natural grouping, and versioning. The team states, “Files are a simple yet powerful primitive, safer than adding another abstraction layer.”

Custom Semantic Search : Embeddings are trained on agent trajectories to predict which files should be retrieved early, improving search accuracy by 12.5% and code retention in large codebases by 2.6%.

3. Manus – KV‑Cache‑First Design

Logit Masking Over Tool Removal : Instead of dynamically adding/removing tools (which invalidates KV‑Cache), Manus loads all ~29 tools permanently and uses logit masking during decoding to control availability.

Hierarchical Action Space : Three levels – Level 1 (≈20 atomic functions), Level 2 (sandboxed Bash utilities and MCP tools), Level 3 (dynamic scripts with pre‑installed libraries). This keeps tool definitions out of the context window while preserving capability.

Key Lesson : Simpler shells (using Shell instead of complex tool definitions) scale better as models become stronger.

4. SWE‑Agent – Agent‑Computer Interface

Linter‑Guarded Code Editing : When the agent issues an edit, a linter runs first; if the code is syntactically invalid, the edit is rejected and the agent must retry, improving reliability by ~3%.

Observation Compression : Only the last five observations are kept verbatim; older observations are compressed into a single line, implementing progressive disclosure inside the loop.

5. Other Players

Devin : Runs in isolated cloud VMs, uses a Playbook system and knowledge management; PR merge rate 67% vs 34% a year earlier.

Windsurf : Dual‑agent architecture with a planning agent running continuously and a short‑term execution model; memory system generates cross‑session observations.

Aider : Builds a PageRank‑based code‑base map using tree‑sitter, selecting the most important symbols for token‑budgeted context.

Replit Agent : Evolved from a single agent to a three‑agent stack (Manager, Editor, Verifier) with a self‑repair loop: generate → execute → test with Playwright → fix → rerun.

Progressive Disclosure: The Unnamed Pattern

Borrowed from UI/UX design, progressive disclosure shows only what is needed now and reveals complexity on demand. For agents, this means layered context loading that reduces attention fragmentation.

Production Implementations

Claude Code SKILL.md : Skills are stored in .claude/skills/ and are loaded only when Claude detects relevance, avoiding pre‑loading and context bloat.

Cursor Delayed MCP Tool Loading : Tool definitions are kept in a folder; only names are provided as static context, with full definitions fetched on demand, cutting token usage by 46.9%.

Manus File‑System Unloading : Agents read/write files as needed; URLs can be dropped while keeping the file path, and todo.md pushes the global plan into the model’s recent attention window, combating “lost in the middle” effects.

Quantitative Cases: The Power of the Harness

Token Efficiency

Claude‑Mem static loading injects 25,000 tokens for 0.8% relevance; progressive disclosure reduces this to 955 tokens, a 26× efficiency gain.

Cursor’s delayed loading saves 46.9% of tokens.

Vercel removed 80% of tools, dropping token count from 145,463 to 67,483, steps from 100 to 19, and latency from 724 s to 141 s, turning task failure into success.

Harness Beats Model

CORE‑Bench: Claude Opus 4.5 scores 42% with one harness, 78% with another.

Sonnet 4: 33% vs 47%; Sonnet 4.5: 44% vs 62% across different harnesses.

LangChain deep‑agents CLI improves from 52.8% to 66.5% on TerminalBench 2.0 by changing only the harness.

Why Harnesses Matter

Liu et al. (TACL 2024) show LLM performance follows a U‑shaped curve: information at the beginning or end of the prompt yields the best results, while middle information degrades performance. Progressive disclosure keeps inputs concise and places newly retrieved information at the end, preserving high attention.

Industry Consensus and Divergence

Consensus : Single flat loops beat complex orchestration; file systems act as extended memory; retain error logs; use pseudo‑planning tools (TodoWrite, todo.md); primitives (bash, grep, file system) outperform custom integrations.

Tool‑Overload Divergence : Manus prefers loading all tools with logit masking; Cursor prefers on‑demand loading. Both are effective; the choice depends on token economics.

Management Intensity Divergence : Google bets on giving the model everything (2 M token window), while others build shells that filter and dispatch information. Evidence suggests “heavy harness” solutions currently win in practice.

Open Questions : No standard benchmark exists to compare harness quality head‑to‑head; evaluation metrics and when to share sub‑agent state remain experiential.

Takeaways for Building Agents

Invest engineering effort in the harness rather than chasing the newest model.

Progressive disclosure is not optional; it is a required architectural pattern.

Design your harness to become simpler, not more complex, as capabilities grow.

As Dex Horthy (creator of the “12‑Factor Agents” methodology) notes, once model input exceeds roughly 40% of its context window, agents enter a “dumb zone” where signal‑to‑noise drops and errors appear—not because the model is weak, but because the harness overloads the context.