Artificial Intelligence 27 min read

How a Planner‑Generator‑Evaluator Trio Enables Claude to Build Full‑Stack Apps Autonomously

The article details a GAN‑inspired three‑agent architecture—planner, generator, and evaluator—that overcomes Claude's self‑evaluation bias and context‑window limits, enabling hours‑long autonomous coding of complete front‑end and full‑stack applications with measurable cost and quality improvements.

AI Tech Publishing

Mar 31, 2026

How a Planner‑Generator‑Evaluator Trio Enables Claude to Build Full‑Stack Apps Autonomously

1 Why naive approaches fall short

Earlier experiments showed that a simple harness with an initial agent breaking product specs into tasks and a coding agent handling each task suffered from two failure modes when tasks grew complex. First, as the context window filled, Claude exhibited "context anxiety" and lost coherence; a full context reset —clearing the window and handing over state via structured artifacts—solved this but added token cost and latency. Second, Claude consistently over‑rated its own output, a self‑evaluation distortion that became evident when asked to judge its work.

2 Front‑end design: making subjective quality scoreable

To address the evaluation bias, the author introduced a separate evaluator agent and defined four concrete scoring criteria that translate aesthetic judgments into measurable dimensions:

Design quality – cohesion of color, typography, layout, imagery, and overall atmosphere.

Originality – presence of deliberate design decisions versus template‑like defaults.

Craft – technical execution such as font hierarchy, spacing consistency, color harmony, and contrast.

Functionality – usability, discoverability of primary actions, and task completion.

The author weighted design quality and originality higher than craft and functionality, used few‑shot examples to calibrate the evaluator, and ran 5‑15 iterative cycles where the generator produced HTML/CSS/JS, the evaluator (driven by Playwright MCP) interacted with the live page, scored each criterion, and fed detailed feedback back to the generator. Over multiple runs the scores rose, sometimes plateaued, and the language of the scoring prompt (e.g., "the best designs are museum quality") directly shaped the visual style.

3 Extending to full‑stack development

Architecture design

The three‑agent system consists of:

Planner : expands a 1‑4 sentence prompt into a full product specification, deliberately keeping high‑level design and AI feature suggestions while avoiding premature low‑level detail.

Generator : works in sprints, each sprint taking a spec item and implementing it with React, Vite, FastAPI, and SQLite (later PostgreSQL). It self‑evaluates before handing off to QA and uses git for version control.

Evaluator : uses Playwright to browse the running app, executes UI actions, tests API endpoints, and scores each sprint against the extended front‑end criteria plus product depth, visual design, and code quality. Any failure below a hard threshold marks the sprint as failed and returns concrete bug reports.

Before each sprint, the generator and evaluator negotiate a sprint contract that defines the implementation plan and verification steps, ensuring alignment with the high‑level spec while avoiding over‑specification.

Running the harness

Using Claude Opus 4.5, a single‑agent run took 20 minutes and cost $9, producing a minimal but functional prototype. The full three‑agent harness required about 6 hours and $200, delivering a richly featured application with coherent design, functional UI, and integrated Claude‑powered AI features. Cost and time tables were logged as follows:

Single agent: 20 min, $9

Full harness: 6 h, $200

During the full run, the planner generated a 16‑item spec spread over ten sprints, including sprite editors, animation systems, audio, AI‑assisted asset generation, and shareable export. The evaluator caught numerous bugs, such as a rectangle‑fill tool that only placed tiles at the start point, a delete‑key handler that required both selection and selectedEntityId, and a FastAPI routing conflict where /frames/reorder was shadowed by a numeric frame_id route.

Failure – The fillRectangle function was not triggered on mouseUp, leaving the area unfilled.

Failure – Delete key handling required both selection and selectedEntityId , but only the latter was set.

Failure – PUT /frames/reorder matched the numeric frame_id route, returning a 422 error.

Iterating the harness revealed that the evaluator’s feedback loop dramatically improved visual distinctiveness and functional robustness, though occasional regressions occurred when the generator pursued a new aesthetic direction.

Iterative simplification

After Opus 4.6 reduced context‑anxiety, the author removed the sprint structure, keeping only planner and evaluator. The evaluator shifted to a single end‑of‑run review. This simplification cut runtime to 3 h 50 min and $124.70 for a digital audio workstation (DAW) generation task, while still catching critical gaps such as missing audio recording, incomplete clip manipulation, and non‑graphical effect controls.

Design fidelity is excellent and AI integration works, but core DAW functions remain stubbed: no audio capture, no clip drag‑and‑drop, and only numeric sliders for effects.

Overall, the work demonstrates that separating "doing" and "judging" agents, defining explicit scoring metrics, and iteratively pruning the orchestration framework yields higher‑quality autonomous code generation, especially when the target task sits near the model’s capability boundary.

Key take‑aways: experiment with the specific model, monitor its real‑world behavior, calibrate dedicated evaluators for subjective dimensions, decompose complex tasks into specialized agents, and continuously reassess the harness as models improve.