Long-Run Verification: Converging AI Agents from Continuous Execution to Engineering
The article analyses experiments with Claude Code dynamic workflows and a 50‑hour timetravel‑agent prototype, exposing how long‑running AI coding tasks drift without proper verification gates and proposing a four‑step gate framework to ensure convergence, evidence collection, and reliable engineering outcomes.
Experiments with /workflows and /goal
Over a weekend the author combined Claude Code /workflows with a /goal ‑driven timetravel‑agent prototype that implements a JavaScript time‑travel debugger (instrumentation, runtime recording, trace query, replay, HTTP service, React/Next demo). The dynamic workflows let parallel sub‑agents be orchestrated via JavaScript scripts executed by a local runtime.
Agent can work for a long time and in parallel, but the system must know when it can truly stop.
While Bun’s Zig‑to‑Rust migration shows agents can participate in large‑scale rewrites, Claude Code’s dynamic workflows embed orchestration logic in JavaScript, allowing agents to keep pushing toward a goal beyond a single conversational turn.
/goal: Long Tasks Need More Than Continuous Hand‑off
The author’s earlier /goal experiment built a timetravel‑agent that mimics Wallaby.js. Although the prototype produced a UI, README, and shallow tests, it failed to converge on the true completion criteria: trace completeness, state replay, branch and exception coverage, and semantic consistency.
The UI now shows compiled artifacts, source‑to‑instrumented mappings, probe counts, source maps, and integrity status—useful entry points but still not full proof of correctness.
To address this, the author introduces a gate pipeline:
Goal -> Plan -> Build -> Verify -> Review -> DoneFor the timetravel‑agent the gates become:
Trace Gate – checks event completeness.
Replay Gate – verifies state can be restored.
UI Gate – ensures the debugging panel truly aids problem location.
Mismatch Gate – detects semantic differences between generated and source code.
Evidence Gate – requires a bundle of commands, reports, screenshots, and failure logs.
Thus /goal should not merely keep the agent running; the gates turn continuous execution into convergent verification.
/workflows: Parallel Agents Require Result Reduction
When /workflows exposes horizontal convergence, the dynamic workflow turns a long context into a JavaScript orchestration script that schedules many sub‑agents. This moves task state out of the chat window into script variables, stage outputs, and agent results.
Typical use cases include codebase‑wide bug sweeps, 500‑file migrations, and cross‑checked research. The pattern resembles MapReduce: fan‑out sub‑agents perform search, audit, migration, and verification; the workflow then aggregates findings into patches, reports, or conclusions.
Cost analysis of a /deep‑research workflow on a Node.js permission‑model study shows total tokens of 3.31 M, with the Verify stage consuming 1.76 M (over 50 %). The most expensive part is not search or synthesis but verification.
Parallelism alone only expands coverage; trustworthy results require claim extraction, evidence tracing, conflict merging, and failure marking before reduction.
Long‑Run Verification: From “Keep Going” to “Keep Proving”
Combining the timetravel‑agent and dynamic workflows reveals two failure modes:
Vertical drift: the original goal (e.g., a time‑travel debugger) shifts to superficial milestones like UI completion or README generation, leaving trace and replay unverified.
Horizontal error expansion: dozens of sub‑agents run in parallel, increasing coverage but also amplifying uncertainty if claims are not extracted, evidence voted on, and conflicts resolved.
The most dangerous point is that agents never “fail”; they simply replace the original goal with the easiest achievable sub‑goal.
To prevent this, the author defines “long‑run verification” with four mandatory elements:
Before starting, specify a checkable done‑condition rather than a vague wish.
During execution, insert checkpoints that justify moving to the next stage.
After parallel execution, apply a reduce mechanism that turns results into evidence, not just a summary.
Before marking Done, produce an evidence bundle containing commands, reports, screenshots, failure records, and unresolved risks.
These seemingly simple steps are essential for engineering‑grade AI coding tasks; without them, longer runtimes only widen the verification gap.
Conclusion
Future AI coding competition will hinge on turning “finished” into a set of provable engineering facts. Small gate mechanisms may be the most critical unit inside an agent runtime.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
phodal
A prolific open-source contributor who constantly starts new projects. Passionate about sharing software development insights to help developers improve their KPIs. Currently active in IDEs, graphics engines, and compiler technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
