Artificial Intelligence 25 min read

Long‑Running Agents: From Ralph Loop to Hand‑over‑Ready Harness

The article analyzes the challenges of long‑running AI agents, showing that persistence alone is insufficient and that reliable hand‑over requires explicit specifications, external state files, drift mitigation, sub‑agents, and a verifiable evidence chain to keep the work understandable for the next model or human.

Architect

May 10, 2026

Long‑Running Agents: From Ralph Loop to Hand‑over‑Ready Harness

Why simple persistence isn’t enough

OpenAI Codex’s /goal feature keeps an agent running continuously, but the real difficulty of long‑running tasks is ensuring the agent stays on the correct path after many hours, multiple context windows, and sub‑agent hand‑offs.

Key observations

/goal

solves continuity, not correctness.

The original Ralph Loop accumulates goal drift, context drift, and quality drift each round.

Long‑running agents fear "diligent drift" more than premature termination.

Pre‑spec files (GOAL.md, PLAN.md, STANDARDS.md, PROGRESS.md) act as hand‑over evidence.

Sub‑agents are valuable for isolation, not just role‑play.

Multiple agents are expensive and should be used as a quality‑governance tool when the task is large, the benefit high, and the verification path clear.

The watershed for long‑running agents is moving from "can continue" to "can be handed over, rolled back, and replayed".

Putting several ideas together

The Ralph Loop, introduced by Geoffrey Huntley, avoids piling failures, attempts, and logs into an ever‑growing conversation. Each round starts from a relatively clean context using files, code, tests, and git history.

Block’s Goose tool implements a similar mechanism: a worker does work each round, a reviewer writes a summary, feedback and completion markers into files, and the next round reads them again.

OpenAI extended this idea with /goal, defining a durable objective that persists across multiple rounds and requires a clear purpose, constraints, verification method, and stop condition.

Anthropic’s recent papers on long‑running agents and context engineering describe the same problem: agents need more than a long chat window; they need an external, verifiable work set.

From "can continue" to "can be handed over"

Long‑running agents face three kinds of drift:

Goal drift : the agent forgets the original problem and pursues a locally complete solution, causing output to diverge from real needs.

Context drift : compression, truncation, or summarisation loses key information, so later decisions are based on incomplete facts.

Quality drift : the agent becomes over‑confident that it has finished correctly, leading to missing tests, boundary errors, or architectural decay.

Pre‑specifications cut wrong branches early, preventing token waste on a bad path.

Step 1 – Clarify critical forks before starting

An interview‑style phase asks the agent to enumerate major decisions such as bug‑fix vs. refactor, backward compatibility, performance vs. maintainability, test coverage scope, UI design system, and failure‑handling policy. Leaving these forks undefined lets the agent make ad‑hoc choices that later cause drift.

Step 2 – Write memory outside the window

External files become the reliable source of truth. Jarrod Watts’ long‑running‑agent‑skill maintains a set of files: GOAL.md – the objective and non‑goals STANDARDS.md – architectural constraints IMPLEMENT.md – implementation notes PROGRESS.md – incremental progress and decisions

These files serve as hand‑over evidence for the next executor and as project documentation for humans.

Cautionary example : an agent wrote in PROGRESS.md that a certain optimisation was mathematically impossible. Subsequent agents accepted this as fact and stopped trying, until a human intervened and discovered the conclusion was wrong.

Memory files should be layered into four categories:

Facts : which files changed, which tests passed, safe commit points.

Observations : phenomena seen during attempts, unstable paths.

Hypotheses : current guesses that are not yet verified.

Decisions : committed trade‑offs that must not be overwritten.

Mixing hypotheses with facts leads to a polluted hand‑over.

Step 3 – Use independent context for review

Sub‑agents provide isolation. A fresh reviewer agent reads only the goal, diff, standards, test results, and key decisions, then asks simple questions such as:

Does this change satisfy the objective?

Did we introduce unintended behaviour?

Are tests covering only the happy path?

Did we silently break previous behaviour?

Can the next engineer understand the architectural trade‑off?

Anthropic reports that multi‑agent systems cost roughly 4× the token usage of a single chat and up to 15× for full research systems, so they should be reserved for high‑value, parallelisable tasks.

Evidence chain for hand‑over

Combine the previous steps into a three‑layer evidence chain:

Goal layer : What exactly should be built? – Goal, non‑goal, acceptance criteria, pre‑clarifications.

State layer : Where are we now? – Progress log, decision log, git history, milestone state.

Governance layer : Are we doing it right? – Tests, review agent, lint, type‑check, human checkpoint.

If any layer is missing, the long‑running task becomes fragile because the next agent cannot know the current position, cannot verify correctness, or may satisfy tests that do not meet user intent.

Hand‑over criteria

What is the current objective?

Which facts have been established?

Which items are still hypotheses?

Which decisions must not be altered?

Which tests prove the current state?

Where is the latest safe rollback point?

If these questions cannot be answered, the artifact is effectively a tangled context that no one can safely continue.

Production‑grade standard

The real metric is whether the work left after many hours can be handed over to a human, a new agent, a reviewer, CI, or a future model. Traditional software engineering relies on commit messages, PRs, tests, ADRs, and changelogs to avoid dependence on a single person’s temporary context; the same principle applies to agentic engineering.

Conclusion

Codex /goal, Jarrod Watts’ long‑running‑agent‑skill, and Anthropic’s harness experiments all demonstrate that continuity alone is insufficient. A robust long‑running agent must produce an auditable, verifiable work set that can be handed over, rolled back, and replayed.

References

OpenAI Codex /goal official use case: https://developers.openai.com/codex/use-cases/follow-goals

OpenAI Codex CLI 0.128.0 release: https://github.com/openai/codex/releases/tag/rust-v0.128.0

Jarrod Watts long‑running‑agent‑skill: https://github.com/jarrodwatts/long-running-agent-skill

Anthropic: Effective harnesses for long‑running agents: https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

Anthropic: Effective context engineering for AI agents: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

Anthropic: How we built our multi‑agent research system: https://www.anthropic.com/engineering/multi-agent-research-system

Block: Ralph Loop implementation (Goose docs): https://block.github.io/goose/docs/tutorials/ralph-loop/

Geoffrey Huntley interview: Inventing the Ralph Wiggum Loop: https://linearb.io/dev-interrupted/podcast/inventing-the-ralph-wiggum-loop

Andrej Karpathy on long‑running orchestrator: https://x.com/karpathy/status/2026731645169185220

Boris Cherny on sub‑agents and independent context windows (Latent Space interview, Claude Code tweet)

Matt Pocock skills repository: https://github.com/mattpocock/skills

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents Context Engineering Ralph Loop subagents Harness Long-Running Agents hand‑over

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Why simple persistence isn’t enough

Key observations

Putting several ideas together

From "can continue" to "can be handed over"

Step 1 – Clarify critical forks before starting

Step 2 – Write memory outside the window

Step 3 – Use independent context for review

Evidence chain for hand‑over

Hand‑over criteria

Production‑grade standard

Conclusion

References

Architect

How this landed with the community

Was this worth your time?

0 Comments

Step 1 – Clarify critical forks before starting

Step 2 – Write memory outside the window

Step 3 – Use independent context for review