Deep Dive into Loop Engineering: From Prompt Engineering to System Design
Loop Engineering replaces manual prompting with system‑designed loops that let AI agents iterate autonomously, covering its definition, origins, five core modules plus memory, a full‑stack example, experimental results, limitations, and a comparison between Claude Code and Codex.
1. Core Definition: What Is Loop Engineering?
Loop Engineering, defined by Google Cloud AI Engineering Director Addy Osmani, means using a system you design to prompt an agent instead of prompting the agent yourself. In other words, you write a Loop rather than a Prompt, allowing the model to work continuously—even while you sleep—by repeatedly pursuing a defined goal until it is satisfied.
2. Background: Two Cognitive Leaps at Anthropic
According to Boris Cherny, the creator of Claude Code, Anthropic engineers experienced two major shifts:
First leap (≈1.5 years ago): From writing code to writing Prompts, treating the model as a code‑generating assistant.
Second leap (ongoing): From writing Prompts to designing Loops, where a Loop orchestrates the agent instead of direct interaction.
Third leap (in progress): Toward autonomous collaboration among multiple agents, where humans only define business goals.
3. The Five Core Modules (+1 Memory Mechanism)
Addy Osmani identifies five essential components that both OpenAI Codex and Anthropic Claude Code implement, plus a persistent memory layer.
Module 1: Automation
Automation turns a Loop into a true recurring process. In Codex, tasks are created on the “Automations” tab with a project, Prompt, and schedule; results go to a triage inbox or are auto‑archived. Claude Code uses the /loop command with cron‑style intervals or lifecycle hooks, and the key /goal command runs until a user‑defined condition becomes true, delegating completion judgment to a separate small model.
Module 2: Worktree Isolation
Running multiple agents can cause file conflicts. The solution is Git worktrees: each agent works in an isolated directory on its own branch, sharing repository history but preventing cross‑writes. Codex has built‑in worktree support; Claude Code enables it with the --worktree flag and isolation: worktree configuration.
Module 3: Skill
Skills encapsulate reusable intent and context so the agent does not need a full project briefing each run. Both tools store a SKILL.md folder containing commands, metadata, optional scripts, and assets. Skills are invoked with $ or /skills. They also solidify intent, turning one‑off prompts into cumulative knowledge.
Module 4: Connectors
Connectors, built on the MCP protocol, let agents interact with issue trackers, databases, staging APIs, or Slack. Both Codex and Claude Code support MCP, so a Connector written for one often works for the other. Plugins bundle Connectors with Skills for easy distribution.
Module 5: Sub‑Agent
Separating code generation from code review improves reliability. Sub‑Agents are defined in TOML files under .codex/agents/ or .claude/agents/. A common pattern uses an Explorer, an Implementer, and a Verifier. Sub‑Agents consume more tokens but focus verification where it matters most.
+1 Memory Mechanism
A persistent markdown file or Linear board records what has been done and what remains, because large‑language models forget between runs. The memory file lives on disk, ensuring continuity even when the agent restarts.
4. A Complete Loop in Practice
Every morning an automated task runs in the repository. It invokes a Triage Skill that reads yesterday’s CI failures, open issues, and recent commits, then writes findings to a markdown file or Linear board. For each actionable item, the Loop creates a separate worktree, spawns a Sub‑Agent to draft a fix, and a second Sub‑Agent to verify the draft against project Skills and tests. Connectors automatically open PRs, update tickets, and post status to Slack. Unhandled items go to a human‑review inbox. A state file tracks progress so the next run resumes where the previous one left off.
5. Self‑Correction Experiment with Claude Fable 5
Lance Martin at Anthropic ran a “Parameter Golf” experiment: train a model on eight H100 GPUs in under ten minutes to fit within a 16 MB artifact. The Loop edited training code, launched training, polled logs, read scores, and decided the next experiment.
Key finding: having an independent verification Sub‑Agent score the output is far better than the model self‑scoring, because the scoring occurs in a separate context window. The CMA Outcomes feature automatically creates such a scoring Sub‑Agent.
Results: Fable 5 improved the training pipeline roughly six‑fold compared with Opus 4.7. Fable 5 made larger structural bets (e.g., architecture changes) and showed greater resilience, such as surviving a quantization rollback, whereas Opus 4.7 only achieved incremental scalar tweaks.
Memory usage comparison on an SQL sequential‑question task:
Sonnet 4.6 stopped after the first step, storing only failures and guesses, with little reference to prior notes.
Opus 4.7 stopped after the third step, building a partially uncertain reference model with low coverage (7‑33%).
Fable 5 completed the full path, achieving 73 % verification coverage and extracting generic rules.
6. Three Things Loops Can’t Do
Verification remains the human’s responsibility: An unsupervised Loop can still produce erroneous code; the “completed” claim is a statement, not proof.
Understanding debt grows: Faster Loop output widens the gap between generated code and the developer’s mental model unless the developer reviews the results.
Inaction is a risk: A Loop that runs without judgment may accept any output, leading to cognitive surrender. Designing Loops with built‑in judgment mitigates this.
7. Claude Code vs. Codex: Tool Comparison
Both tools share the same five core modules, differing only in naming and entry points. (Image omitted for brevity.)
8. Paradigm Shift: From Prompt Engineering to Loop Engineering
The lever moves from Prompt to Loop design. Previously, a well‑crafted Prompt yielded good results; now, the quality of the designed Loop determines output quality. The same Loop can produce vastly different outcomes for different users, depending on whether they use it to deepen understanding or to avoid it.
Designing effective Loops requires deep engineering expertise and sufficient token budget, as the system must be meticulously configured and supervised.
9. Three Engineer Levels
L1: Manually write code line by line.
L2: Write Prompts for an agent to generate code (dialogue‑based output).
L3: Design Loops that let agents iterate automatically (systematic output).
Loop Engineering is the key that moves engineers from L2 to L3.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
