Artificial Intelligence 19 min read

From Prompt Chains to Python State Machines: Evolving Production‑Grade AI Orchestration

This article chronicles three generations of production‑grade AI orchestration—from fragile Claude Code skill chains, through adversarial sub‑agent pipelines with explicit judges, to a deterministic Python state‑machine built on the Claude Agent SDK—highlighting how structured control flow, task splitting, and budget enforcement dramatically improve reliability over raw prompt‑driven workflows.

AI Waka

Apr 14, 2026

From Prompt Chains to Python State Machines: Evolving Production‑Grade AI Orchestration

Background

The author reflects on the unreliability of LLM‑driven agents that behave like highly intelligent but inebriated interns: they start strong but quickly drift into hallucinations. To combat this, a three‑generation research pipeline—dubbed Harness Engineering—was built to turn AI agents into production‑grade software.

First Generation: Claude Code Skill Chain

A simple sequential chain of Claude Code skills reads the output of the previous step and produces the next artifact. It is fast, easy to run manually, and suitable for rapid prototyping, but suffers from several critical flaws:

No real retry mechanism; a single missed error propagates downstream.

No parallelism; large contexts cause "context panic" and vague outputs.

Self‑scoring leads to biased results.

State is implicit and hard to recover.

Cost is invisible until the bill arrives.

Thus, speed is achieved at the expense of correctness.

Second Generation: Sub‑Agent Orchestration with Judges

To improve quality, the pipeline introduced separate Doer (editor) and Judge (checker) sub‑agents. The judge enforces strict scoring, and the orchestrator retries up to three times based on the judge's feedback. Control flow remains expressed in natural‑language prompts, which introduces drift.

FOR attempt 2 and 3:
  INVOKE editor with:
    "Read your previous output. The checker found issues (JSON below).
    Fix ONLY the flagged issues. [If final attempt: ONLY critical/major]
    Checker feedback:{verdict.issues as JSON}"

CRITICAL: Send ONLY the most recent checker feedback, NOT cumulative.

Key advantages include clearer separation of production and validation, better accuracy, and parallel execution of independent parts. However, the prompt‑based control flow still suffers from:

Retry counters and "latest‑only" constraints can drift without explicit logging.

JSON generated by the LLM is fragile and often malformed.

State is stored only in the LLM’s memory, making recovery unreliable.

Costs remain hidden until after execution.

Third Generation: Claude Agent SDK Python State Machine

The final iteration replaces Markdown prompts with deterministic Python code using the Claude Agent SDK. The orchestrator now enforces retry limits, budget caps, and state persistence directly in code, eliminating the "feel‑based" control flow of earlier generations.

for attempt in range(1, 4):
    result = await editor()
    verdict = await checker(result)
    if verdict.status == "pass":
        break
    else:
        raise HumanInTheLoop("You sort this out, boss.")

Additional engineering components include:

Editor/Checker Loop that runs deterministic retries and escalates to human‑in‑the‑loop when max retries are exceeded.

PipelineState dataclass for explicit stage tracking, cost accounting, and resumable checkpoints.

Async parallel execution with explicit barriers using asyncio.gather to respect rate limits.

Budget enforcement via a CostAccumulator that aborts the run once a predefined USD limit is hit.

All decision points—whether to retry, which JSON to accept, when a stage is complete, and when the budget is exceeded—are now expressed in code rather than in ambiguous natural‑language instructions.

Key Lessons

Non‑determinism is the primary cause of AI project failures; control flow must be coded, not suggested.

LLMs should never self‑score; separate executor and validator agents are essential.

Task splitting reduces context size, mitigates "context rot", and improves accuracy.

Failures must be loud and escalated; silent hallucinations erode trust.

Reliability outweighs raw speed—slow but correct results preserve brand credibility.

Explicit budgeting prevents runaway costs and forces clean shutdowns.

LLM prompt engineering Reliability AI Orchestration budget enforcement Claude Agent SDK Python state machine

Written by

AI Waka

AI changes everything

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.