OpenAI vs Anthropic: Two Harness Strategies for Code Agent Engineering
Both OpenAI and Anthropic published 2026 papers on AI Code Agents, but they propose opposite harness designs—OpenAI relies on strict structural constraints and linter‑driven engineering, while Anthropic uses a three‑agent evaluation loop—offering a detailed comparison of their mechanisms, trade‑offs, costs, and suitable use cases.
In early 2026 OpenAI and Anthropic each released a deep‑technical paper titled with the word “Harness.” Both describe systems that enable AI agents to reliably produce high‑quality code over several hours of continuous operation, yet the two papers present fundamentally different engineering approaches.
What the two papers describe
OpenAI: a 7‑person team and a 1‑million‑line code experiment
OpenAI’s article “Harness Engineering: Leveraging Codex in an Agent‑First World” (Lopopolo 2026) documents a five‑month experiment carried out by a team that grew from three to seven engineers. Under a strict “zero human‑written code” constraint, the Codex agent produced roughly one million lines of code and 1,500 merged pull requests. Agents ran for up to six hours after engineers left for the day, operating in a real production monorepo.
The core discovery was that when AI becomes the primary code producer, engineers shift from writing code to designing a robust environment—what the authors call a harness . The harness is a layered domain architecture enforced by a custom linter: Types → Config → Repo → Service → Runtime → UI Each sub‑system has an AGENTS.md file (88 in total) that supplies contextual indexes. Every agent runs in an isolated Git worktree equipped with a full observability stack. A “doom‑loop” detector fingerprints tool calls within a sliding window to prevent infinite loops, and an error classifier injects targeted recovery hints for permission, file‑not‑found, and syntax errors. After the main coding agent finishes, multiple reviewer agents perform code reviews locally and in the cloud, iterating until all reviewers are satisfied. Large refactorings spawn sub‑agents that run in parallel.
One‑sentence summary: OpenAI built a strict regulatory railway network and let a fleet of trains run on it.
Anthropic: three agents “competing” to build a product from scratch
Anthropic’s article “Harness Design for Long‑Running Application Development” (Rajasekaran 2026) tackles a different problem: having an LLM autonomously create a complete, product‑grade application from nothing, rather than fixing bugs in an existing codebase.
The authors identify two failure modes: “context‑window anxiety,” where the model’s output quality collapses near the token limit, and “self‑evaluation bias,” where an agent’s self‑review is overly optimistic. Their solution is a three‑agent collaboration:
Planner expands high‑level requirements into a list of 200+ detailed features.
Generator implements those features in sprint‑like iterations.
Evaluator uses Playwright to interact with the running app—clicking buttons, dragging components, entering data—and scores it on design, originality, craftsmanship, and functionality. Any dimension below a threshold forces a full sprint redo.
A full run costs 200 token units (about 20× a baseline without a harness). The paper showcases a digital‑audio‑workstation (DAW) case study: three build‑QA cycles took 2 h 7 min, 1 h 2 min, and finally 10.9 min, clearly demonstrating convergence of the feedback loop.
One‑sentence summary: Anthropic assembled a three‑person team—product manager, developer, QA—and let them collaborate autonomously until delivery.
Core differences: hard constraints vs soft evaluation
Mechanism – OpenAI uses structural constraints (linter + CI + layered architecture); Anthropic relies on behavioral evaluation (independent Evaluator with Playwright).
Agent topology – OpenAI: coding agent + multiple reviewer agents + sub‑agents under a shared rule set; Anthropic: Planner / Generator / Evaluator each with distinct responsibilities.
Quality assurance – OpenAI: pre‑emptive rule enforcement; Anthropic: post‑hoc evaluation.
State management – OpenAI: repository acts as a state machine with per‑worktree isolation; Anthropic: file system + Git + sprint contracts.
Fault tolerance – OpenAI: fast detection + cheap rollback (short‑lived PRs); Anthropic: sprint‑level failure and retry based on hard thresholds.
Run cost – OpenAI: not disclosed, but token cost exists in review loops; Anthropic: 200 token units per run, ~20× a no‑harness baseline.
The contrast is not merely “pre‑commit checks vs post‑commit testing.” It reflects a deeper split: OpenAI’s problem is engineering‑oriented (deterministic, rule‑checkable), while Anthropic’s is design‑oriented (subjective, requiring judgment).
Current ceiling of end‑to‑end Code Agent delivery
Both papers start from minimal input: OpenAI from a single engineer prompt, Anthropic from a 1‑4 sentence product brief. OpenAI’s endpoint is a merged PR in an existing monorepo; Anthropic’s endpoint is a fully runnable application with UI, backend, and database. The former resembles a senior engineer’s workflow, the latter a full‑stack outsourcing team.
What can already be achieved:
With a well‑designed harness, agents can run for hours, understand requirements, generate code, and deliver it autonomously.
OpenAI demonstrated incremental feature development in a million‑line codebase; Anthropic built a multi‑module DAW from scratch.
What remains out of reach:
OpenAI admits agents inherit sub‑optimal patterns from the repository, requiring ~20 % of weekly engineering time for “AI‑mud” cleanup.
Anthropic’s DAW suffered physics engine bugs, nonsensical level design, and an evaluator that cannot judge musical taste (“Claude can’t actually hear”).
Both acknowledge that agents are near‑usable for functional correctness but still lack “taste” and “judgment” at the design level.
The true ceiling is not whether agents can finish a task, but whether the delivered quality meets production standards at an acceptable cost. Anthropic reports a 20× token cost over a no‑harness baseline; OpenAI’s cost is hidden but includes extensive linter infrastructure and 88 AGENTS.md files.
When to choose which approach
If your agents mainly perform “deterministic” work—bug fixes, feature additions, refactoring in a mature codebase—OpenAI’s constraint‑driven harness is preferable. It scales with team size, offers low runtime cost, and does not depend on model cleverness.
If your agents must exercise “judgment”—building new products, designing user interactions, producing creative or aesthetic output—Anthropic’s evaluation‑driven harness is more suitable, albeit with higher token consumption and linear cost growth.
Technical debt also differs: OpenAI’s debt lies in “over‑constrained” rules that become friction as models evolve; Anthropic’s debt is “model‑dependent,” easing as newer models reduce the need for extensive evaluation loops.
In practice, the two can be combined: a strict linter‑based harness for baseline correctness, supplemented by an independent Evaluator agent for design quality.
Conclusion: the harness era is just beginning
The two papers send a clear signal: the focus of Code Agent engineering is shifting from “which model is stronger” to “which harness design is better.” OpenAI warns that improving agent performance is rarely about pushing the model harder; it is almost always about redesigning the environment.
Future work must address harness templates for different code‑base types (micro‑services, monorepos, data pipelines), long‑term state management across tasks, and automated testing of harness components themselves. As model capabilities improve, the space of viable harness combinations will not shrink but move, making harness engineering a continuously evolving discipline.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Musings
When the AI wave arrives, it feels like we've reached the frontier of technology. Here, an architect records observations and reflections on technology, industry, and the future amid the upheaval.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
