Big Model vs. Big Harness: Who Really Powers AI Agents?

The article examines whether the success of AI agents stems from ever‑stronger large language models or from the surrounding harness—context management, tool orchestration, and reliability engineering—by comparing viewpoints, empirical evaluations, and practical guidance for developers.

AI Engineer Programming
AI Engineer Programming
AI Engineer Programming
Big Model vs. Big Harness: Who Really Powers AI Agents?

What is a Harness?

A harness, originally a horse‑tack, directs the power of the horse; in engineering it bundles cables that guide current. Applied to LLM agents, the model is the horse and the harness comprises context handling, tool integration, retry logic, and state storage.

while True:
    response = model.call(context)
    if response.is_final_answer:
        return response.content
    for call in response.tool_calls:
        result = execute_tool(call.name, call.args)
        context.append(tool_result=result)

Claude Code, Cursor Agent, and Manus all run this loop; the difference lies in how the outer harness is built. Proponents of the “Big Model” view argue for a thin harness, while “Big Harness” advocates treat the harness as the product’s core.

Arguments from the Big Model Camp

Boris Cherny, creator of Claude Code, repeatedly says the secret lies in the model. Their harness is deliberately made thinner and is rewritten every three to four weeks, using the Ship of Theseus metaphor to illustrate incremental replacement of harness components as the model becomes more capable.

Noam Brown adds a historical perspective: before reasoning models, agents relied on multiple calls to ordinary models to simulate reasoning. With dedicated reasoning models, heavyweight scaffolding often hinders performance; simply feeding the problem to the model can suffice. Techniques like Chain‑of‑Thought, Tree‑of‑Thought, and Self‑Consistency illustrate this shift.

Empirical data from Scale AI’s SWE‑Atlas shows that Opus 4.6 scores 2.5 points higher in Claude Code’s proprietary harness than in a generic SWE‑Agent harness, while GPT‑5.2 performs better with a generic harness, indicating that harness impact is modest and model‑specific.

Arguments from the Big Harness Camp

Jerry Liu, founder of LlamaIndex, states that the biggest bottleneck for using AI is the engineer’s ability to manage context and workflow—i.e., the harness. He emphasizes that while models are fixed, harnesses can be optimized for stability.

Vercel discovered that cutting the number of tools an agent can use by 80 % actually improves overall performance, because fewer tool choices reduce decision complexity and error accumulation.

Manus rewrote its harness five times over six months without changing the underlying model, each rewrite noticeably increasing reliability, showing that harness engineering has its own depth.

A blog post titled “Improving 15 LLMs at Coding in One Afternoon – Only the Harness Changed” reports that after harness optimizations, fifteen different LLMs all showed large gains on coding tasks, supporting the view that both model improvements (Y‑axis) and harness improvements (X‑axis) are valuable investment areas.

Why the Debate Has No Clear Winner

Both sides are correct but focus on different timeframes. The Big Model camp looks at a dynamic process where models continuously improve, eventually absorbing functions previously handled by harnesses. The Big Harness camp focuses on the current engineering reality, where the existing model’s performance is heavily dependent on the quality of the surrounding system.

If model capabilities jump every six months, harness work may become obsolete quickly; however, stronger models also enable harder tasks, spawning new harness requirements. Task type matters: coding benchmarks clearly benefit from stronger models, whereas many enterprise agent scenarios are limited by external system variability, API changes, and undocumented data—issues only a robust harness can address.

Stakeholders also have vested interests: model vendors claim the model is the secret sauce, framework vendors champion harnesses, and application teams prefer simplicity. Recognizing these biases helps readers evaluate arguments objectively.

Practical Guidance for Engineers

Instead of choosing a side, engineers should first identify the bottleneck in their project. A quick “bare‑run”—sending the task directly to the model without any harness—can reveal whether the model alone achieves >70 % success. If it does, focus on reliability and observability rather than adding complex orchestration.

If the bare‑run fails, diagnose the failure type: model misunderstanding (swap model or adjust prompts), missing information (improve context engineering), or missing tool usage (design appropriate tools). Each category requires a distinct remedy.

When adding a harness layer, ask why the model cannot handle the problem itself. If the answer is “external system integration,” the layer is justified. If the model forgets context, first try prompt restructuring or context compression before adding external memory. If behavior control is needed, a well‑crafted system prompt may be more maintainable than an additional proxy.

Harness complexity should be proportional to task reliability requirements. Overly complex harnesses increase maintenance cost and become fragile when models are upgraded. Manus’s five rewrites illustrate the hidden cost of frequent harness changes.

Regardless of stance, instrument the system with sufficient logging to locate failures. Without data, the debate remains a matter of belief rather than evidence.

Sources

Latent Space, “Is Harness Engineering real?” (2026‑03‑05)

Scale AI, SWE‑Atlas Evaluation (2026)

METR, Claude Code and Codex Evaluation (2026)

Noam Brown, Latent Space Podcast, “Self‑Improving AI”

Boris Cherny & Cat Wu, Latent Space Podcast, “Claude Code”

Jerry Liu, Twitter/X (2026‑02)

BAIR Blog, “Compound AI Systems” (2024‑02‑18)

blog.can.ac, “Improving 15 LLMs at Coding in One Afternoon – Only the Harness Changed” (2026‑02‑12)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMPrompt EngineeringAI AgentModel ScalingHarness Engineering
AI Engineer Programming
Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.