From Claude Code to Codex: Migrating Anthropic’s Harness Design

The author reproduces Anthropic’s long‑running harness architecture on a Codex + GPT stack, separates planner, generator, and evaluator roles, persists state to concrete artifacts, adds strict execution constraints, and demonstrates that the approach improves task success despite higher costs, while highlighting practical pitfalls and cost‑control strategies.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
From Claude Code to Codex: Migrating Anthropic’s Harness Design

Anthropic’s engineering article “Harness Design for Long‑Running Apps” shifts the focus of agent coding from model capability to the design of the outer system.

What the Harness Solves

Separate planner, generator, and evaluator roles. A single agent that both understands requirements, writes code, and validates itself leads to ever‑looser standards; separating these mirrors proven software‑engineering practice where development and testing are done by different people.

Persist memory to artifacts. Long‑running tasks cannot rely on the model’s fleeting context. All critical state is written to files such as product_spec.md, feature_backlog.json, sprint_contract.json, qa_report.json, progress.jsonl, and the git revision, ensuring traceability without betting on context retention.

Final judgment must come from executable acceptance tests. Real checks—browser clicks, API status, regression tests—are required; otherwise the sophisticated agent orchestration collapses into self‑illusion.

Cost Wall

Running the harness on Claude Code produced good quality but generated expensive bills because multi‑round agents, long‑running executions, browser validation, and repeated retries consume a lot of compute. For independent developers, high per‑run cost forces shortcuts that undermine the harness’s structural guarantees.

Why Move to Codex + GPT

The migration principle is to keep Anthropic’s outer architecture unchanged while replacing the expensive execution plane. Codex handles the generator role (file‑system‑centric code generation), while GPT handles planning, summarisation, and evaluation. This pairing covers all three roles without the prohibitive cost of Claude Code.

What Was Replicated

Three‑role structure. Planner, generator, evaluator remain distinct.

Artifact‑based memory. All state is stored under .harness/runs/<run_id>/:

run_manifest.json
product_spec.md
design_language.md
feature_backlog.json
progress.jsonl
sprint_contract.json
self_test_report.json
qa_report.json
final_report.json

Independent acceptance. Evaluator runs local tests and reads real outputs; browser and API checks are scaffolded as in the Anthropic article.

Multi‑round redo loop. Failed sprints write QA results back into context, trigger a new sprint, and automatically retry when no effective change is detected.

Pitfalls Encountered

Codex needs stricter constraints. Unlike Claude Code, Codex requires explicit input files, output paths, schemas, and clear task definitions; many implicit behaviours had to be made explicit.

Evaluator can be fooled. A lax generator prompt leads Codex to produce a “talk‑first, act‑later” pattern that appears productive but makes no real changes.

Local engineering details matter. Non‑ASCII Windows paths, WebSocket fallback endpoints, repository path handling, and subprocess encoding each caused hours of debugging.

Experimental Results

Three small disposable repositories were used to run real experiments. The observed metrics are:

Fresh‑clone task count: 3

Final pass rate: 3 / 3

First‑round pass: 1 / 3

First‑round failures repaired: 2 / 2

Median sprints to first full‑green: 2

These results demonstrate the key behavior described by Anthropic: the generator actually modifies code, the evaluator independently validates it, and the system self‑heals after an initial failure.

Significance

Anthropic’s contribution was to move the community’s focus from model cleverness to outer‑system design. By successfully transplanting the harness to a cheaper model stack, the methodology is shown to be independent of a single vendor and usable by independent developers and small teams.

Current State and Next Steps

The project is not a verbatim copy of Anthropic’s internal harness and has not been benchmarked at scale. However, the migration path is proven: Anthropic’s harness design works without Claude Code, and Codex + GPT can sustain the same architecture with a more affordable cost structure.

Project link: https://github.com/LongWeihan/codex-long-running-harness
GPTAnthropicCodexClaude CodeAgent Harnesslong-running agents
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.