Long‑Running Agents: From Ralph Loop to Hand‑over‑Ready Harness
The article analyzes the challenges of long‑running AI agents, showing that persistence alone is insufficient and that reliable hand‑over requires explicit specifications, external state files, drift mitigation, sub‑agents, and a verifiable evidence chain to keep the work understandable for the next model or human.
Why simple persistence isn’t enough
OpenAI Codex’s /goal feature keeps an agent running continuously, but the real difficulty of long‑running tasks is ensuring the agent stays on the correct path after many hours, multiple context windows, and sub‑agent hand‑offs.
Key observations
/goalsolves continuity, not correctness.
The original Ralph Loop accumulates goal drift, context drift, and quality drift each round.
Long‑running agents fear "diligent drift" more than premature termination.
Pre‑spec files (GOAL.md, PLAN.md, STANDARDS.md, PROGRESS.md) act as hand‑over evidence.
Sub‑agents are valuable for isolation, not just role‑play.
Multiple agents are expensive and should be used as a quality‑governance tool when the task is large, the benefit high, and the verification path clear.
The watershed for long‑running agents is moving from "can continue" to "can be handed over, rolled back, and replayed".
Putting several ideas together
The Ralph Loop, introduced by Geoffrey Huntley, avoids piling failures, attempts, and logs into an ever‑growing conversation. Each round starts from a relatively clean context using files, code, tests, and git history.
Block’s Goose tool implements a similar mechanism: a worker does work each round, a reviewer writes a summary, feedback and completion markers into files, and the next round reads them again.
OpenAI extended this idea with /goal, defining a durable objective that persists across multiple rounds and requires a clear purpose, constraints, verification method, and stop condition.
Anthropic’s recent papers on long‑running agents and context engineering describe the same problem: agents need more than a long chat window; they need an external, verifiable work set.
From "can continue" to "can be handed over"
Long‑running agents face three kinds of drift:
Goal drift : the agent forgets the original problem and pursues a locally complete solution, causing output to diverge from real needs.
Context drift : compression, truncation, or summarisation loses key information, so later decisions are based on incomplete facts.
Quality drift : the agent becomes over‑confident that it has finished correctly, leading to missing tests, boundary errors, or architectural decay.
Pre‑specifications cut wrong branches early, preventing token waste on a bad path.
Step 1 – Clarify critical forks before starting
An interview‑style phase asks the agent to enumerate major decisions such as bug‑fix vs. refactor, backward compatibility, performance vs. maintainability, test coverage scope, UI design system, and failure‑handling policy. Leaving these forks undefined lets the agent make ad‑hoc choices that later cause drift.
Step 2 – Write memory outside the window
External files become the reliable source of truth. Jarrod Watts’ long‑running‑agent‑skill maintains a set of files: GOAL.md – the objective and non‑goals STANDARDS.md – architectural constraints IMPLEMENT.md – implementation notes PROGRESS.md – incremental progress and decisions
These files serve as hand‑over evidence for the next executor and as project documentation for humans.
Cautionary example : an agent wrote in PROGRESS.md that a certain optimisation was mathematically impossible. Subsequent agents accepted this as fact and stopped trying, until a human intervened and discovered the conclusion was wrong.
Memory files should be layered into four categories:
Facts : which files changed, which tests passed, safe commit points.
Observations : phenomena seen during attempts, unstable paths.
Hypotheses : current guesses that are not yet verified.
Decisions : committed trade‑offs that must not be overwritten.
Mixing hypotheses with facts leads to a polluted hand‑over.
Step 3 – Use independent context for review
Sub‑agents provide isolation. A fresh reviewer agent reads only the goal, diff, standards, test results, and key decisions, then asks simple questions such as:
Does this change satisfy the objective?
Did we introduce unintended behaviour?
Are tests covering only the happy path?
Did we silently break previous behaviour?
Can the next engineer understand the architectural trade‑off?
Anthropic reports that multi‑agent systems cost roughly 4× the token usage of a single chat and up to 15× for full research systems, so they should be reserved for high‑value, parallelisable tasks.
Evidence chain for hand‑over
Combine the previous steps into a three‑layer evidence chain:
Goal layer : What exactly should be built? – Goal, non‑goal, acceptance criteria, pre‑clarifications.
State layer : Where are we now? – Progress log, decision log, git history, milestone state.
Governance layer : Are we doing it right? – Tests, review agent, lint, type‑check, human checkpoint.
If any layer is missing, the long‑running task becomes fragile because the next agent cannot know the current position, cannot verify correctness, or may satisfy tests that do not meet user intent.
Hand‑over criteria
What is the current objective?
Which facts have been established?
Which items are still hypotheses?
Which decisions must not be altered?
Which tests prove the current state?
Where is the latest safe rollback point?
If these questions cannot be answered, the artifact is effectively a tangled context that no one can safely continue.
Production‑grade standard
The real metric is whether the work left after many hours can be handed over to a human, a new agent, a reviewer, CI, or a future model. Traditional software engineering relies on commit messages, PRs, tests, ADRs, and changelogs to avoid dependence on a single person’s temporary context; the same principle applies to agentic engineering.
Conclusion
Codex /goal, Jarrod Watts’ long‑running‑agent‑skill, and Anthropic’s harness experiments all demonstrate that continuity alone is insufficient. A robust long‑running agent must produce an auditable, verifiable work set that can be handed over, rolled back, and replayed.
References
OpenAI Codex /goal official use case: https://developers.openai.com/codex/use-cases/follow-goals
OpenAI Codex CLI 0.128.0 release: https://github.com/openai/codex/releases/tag/rust-v0.128.0
Jarrod Watts long‑running‑agent‑skill: https://github.com/jarrodwatts/long-running-agent-skill
Anthropic: Effective harnesses for long‑running agents: https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Anthropic: Effective context engineering for AI agents: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Anthropic: How we built our multi‑agent research system: https://www.anthropic.com/engineering/multi-agent-research-system
Block: Ralph Loop implementation (Goose docs): https://block.github.io/goose/docs/tutorials/ralph-loop/
Geoffrey Huntley interview: Inventing the Ralph Wiggum Loop: https://linearb.io/dev-interrupted/podcast/inventing-the-ralph-wiggum-loop
Andrej Karpathy on long‑running orchestrator: https://x.com/karpathy/status/2026731645169185220
Boris Cherny on sub‑agents and independent context windows (Latent Space interview, Claude Code tweet)
Matt Pocock skills repository: https://github.com/mattpocock/skills
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
