Migrating from Claude Code to Codex Using Anthropic Harness Principles

The article analyzes Anthropic's harness design for long‑running agent coding, reproduces its three‑role architecture, adapts it to Codex and GPT to cut costs, and presents experimental results that confirm the migrated system remains reliable and self‑healing.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Migrating from Claude Code to Codex Using Anthropic Harness Principles

Harness Design Overview

Anthropic’s engineering post Harness Design for Long‑Running Apps defines a harness for agent‑driven long‑running tasks. The core design consists of three rules:

Separate roles : a planner expands requirements, a generator implements the current sprint, and an evaluator independently validates the output. The three components never overlap, mirroring the proven practice that development and testing should be performed by different agents.

Artifact‑based memory : all state is persisted as concrete files rather than relying on volatile model context. Example artifacts include product_spec.md, feature_backlog.json, sprint_contract.json, qa_report.json, progress.jsonl, and the git revision.

Executable acceptance : the final decision comes from an independent evaluator that runs real tests, checks UI interactions, API responses, and regression behavior.

Motivation for Migration

Applying the harness to Claude Code produced high quality results but incurred prohibitive cost because each sprint required multi‑round execution, browsing, and repeated evaluations. To make the approach affordable for individual developers, the execution plane was swapped: Codex handles the generator role (direct repository and file‑system manipulation) while GPT fulfills planner and evaluator duties (prompt structuring, summarisation, and test validation). This preserves the three‑role architecture while reducing token consumption.

Implementation Details

Three‑role structure retained unchanged.

Artifact‑based memory stored under .harness/runs/<run_id>/ with files such as run_manifest.json, product_spec.md, design_language.md, feature_backlog.json, progress.jsonl, sprint_contract.json, self_test_report.json, qa_report.json, and final_report.json. Each artifact is versioned and traceable without relying on session memory.

Independent evaluator runs local tests, reads actual program output, and can be extended to browser or API checks.

Multi‑round redo loop writes failed test results back into the sprint contract, triggers another generator pass, and repeats until the evaluator reports success.

Practical Pitfalls

Codex requires fully specified inputs – file paths, JSON schemas, and explicit actions – otherwise it produces vague, non‑actionable output.

The evaluator can be tricked by optimistic generator responses; explicit injection of real test failures and automatic retry logic were added to harden the loop.

Low‑level engineering details (non‑ASCII Windows paths, WebSocket fallbacks, subprocess encoding, repository path handling) caused hours of debugging and are essential for a robust harness.

Experimental Evaluation

Three small disposable repositories were used to benchmark the migrated harness. The results are summarised below:

Metric                         Result
----------------------------  -------
Fresh‑clone tasks               3
Final pass rate                3 / 3
First‑round pass               1 / 3
First‑round failures repaired 2 / 2
Median sprints to full green   2

Only one of the three tasks passed on the first sprint; the other two failed initially, were repaired in a second sprint, and all three completed successfully. The median of two sprints demonstrates that the harness forces the generator to make real code changes and the evaluator to perform independent acceptance, while automatically recovering from early failures.

Conclusions

The migration shows that the harness design is model‑agnostic: Codex + GPT can replace Claude Code without altering the outer architecture. By moving the costly generator component to a cheaper model, the overall cost structure becomes suitable for frequent trial‑and‑error cycles. The experiment confirms that the three‑role, artifact‑based, executable‑acceptance pattern reliably drives long‑running agent pipelines.

Project repository: https://github.com/LongWeihan/codex-long-running-harness

AI engineeringGPTCodexClaude Codeharness designagent codinglong-running apps
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.