Artificial Intelligence 13 min read

Andrej Karpathy’s Loop Engineering: 9 Golden Rules for Building Multi‑Day Long‑Running Agents

The article distills Andrej Karpathy’s field notes on Loop Engineering, explaining why prompt engineering is fading, how to treat loops as first‑class objects, separate agent roles, persist state to disk, negotiate contracts, and let robust loops expose and resolve bottlenecks for agents that run for days.

TonyBai

Jul 2, 2026

Andrej Karpathy’s Loop Engineering: 9 Golden Rules for Building Multi‑Day Long‑Running Agents

Developers working with large language model agents often spend late‑night hours tweaking system prompts, only to see the agents crumble when tasks run for hours or days. Karpathy’s notes argue that the real bottleneck is not model intelligence but the surrounding Harness design, and that “Prompt Engineering” is rapidly losing leverage to a more systematic Loop Engineering approach.

A standard loop consists of five concise steps – Gather, Reason, Act, Verify, Repeat – and the rest of the article expands on each verb.

Separate the Roles

Karpathy recommends three distinct roles, each with its own system prompt:

Planner : translates vague human instructions into concrete sprint specifications; never touches code.

Generator : writes all code but is forbidden from self‑scoring its output.

Evaluator : reads code diffs, runs Playwright tests, and starts with the assumption that the code contains bugs, seeking evidence to prove it.

Mixing these roles leads to the common failure mode where the model becomes both judge and contestant, producing useless “slop” code.

Negotiate the Contract First

Before the Generator writes its first line, the agents must agree on a definition of “completion”. They debate via markdown files on disk until they produce a list of testable assertions that serve as the sole grading rubric. For a small app, a contract of about 27 assertions is reasonable; fewer than ten tends to let the Evaluator give away points.

Write State to Disk, Not to Context

Because LLM context windows compress and forget information, Karpathy stores state in persistent files: feature_list.json (feature list) progress.md (progress log) contract.md (the negotiated contract) log.md (append‑only execution log, format: ## [YYYY‑MM‑DD] op | title)

The agent should be able to resume from any crash simply by reading these three files.

Let the Loop Restart

Newer models often prefer to discard a failing codebase and start over. When a dead‑end is detected, the loop should restart automatically; human intervention is only needed if the contract itself is flawed.

Score the Subjective

Subjective qualities such as “taste” can be quantified by weighting four dimensions – Design, Originality, Craft, Functionality – and comparing the agent’s output against curated “good‑taste” references versus “slop” references. The final score is a number between 0 and 1 with an explanatory paragraph.

Read the Traces Like a Stack Trace

Redirect the agent’s full output to a file, then grep for the moment the agent’s judgment diverges from the intended intent. Adjust the prompt for that context and rerun, mirroring how developers debug by reading stack traces.

Delete the Harness

The Harness exists to patch current model shortcomings. As models improve, remove any logic the model can now handle autonomously. A Harness that only grows without shrinking indicates the developer no longer understands it.

The Bottleneck Always Moves

When coding ceases to be the bottleneck, planning becomes one; once planning is solved, verification becomes the choke point; after verification is automated, subjective taste becomes the limiting factor. Designing loops whose sole purpose is to surface the next bottleneck keeps the system continuously improving.

Conclusion

Karpathy’s notes mark the transition from “Prompt Engineer” to “Loop Engineer”. By separating roles, persisting state to disk, front‑loading a testable contract, and treating loops as first‑class objects, developers can build agents that run reliably for days and deliver usable products.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents LLM Prompt Engineering Harness Loop Engineering

Written by

TonyBai

Tony Bai's tech world (tonybai.com). Not satisfied with just "knowing how", we strive for mastery. Focused on Go language internals, high-quality engineering practices, and cloud‑native architecture, exploring cutting‑edge intersections of Go and AI. Gophers who pursue technology are welcome—follow me and evolve with Go.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.