Artificial Intelligence 19 min read

Boosting Claude’s Front‑End Development with a GAN‑Inspired Multi‑Agent Harness

The article details how a GAN‑inspired multi‑agent harness—combining a generator, an evaluator, and a planner—overcomes context‑window anxiety and self‑evaluation bias, enabling Claude to produce higher‑quality front‑end designs and full‑stack applications through iterative scoring, sprint contracts, and extensive cost‑benefit experiments.

Qborfy AI

Apr 19, 2026

Boosting Claude’s Front‑End Development with a GAN‑Inspired Multi‑Agent Harness

Context‑window failure and reset

Long‑running agents lose coherence when the context window fills, producing two failure modes: loss of continuity on extended tasks and context anxiety , where the model prematurely wraps up work. A full context reset —clearing the window, starting a fresh agent, and passing a structured state file—eliminates both problems, unlike compaction , which only summarizes history and leaves the anxiety intact. Experiments with Claude Sonnet 4.5 showed that compaction could not sustain long tasks, so reset became a mandatory harness component despite added token overhead.

Self‑evaluation bias also appears: when the same agent judges its output, it over‑rates mediocre results, especially for subjective tasks such as design where binary correctness checks are unavailable. The remedy is to split the generator from a skeptical evaluator , allowing external feedback to drive concrete iteration.

Front‑end design: quantifying subjective quality

To expose bias, a front‑end design harness was built. Two insights guided the design:

Although aesthetics cannot be reduced to a single score, a rubric encoding design principles (color harmony, typography hierarchy, etc.) gives the model a concrete evaluation basis.

Separating generation from evaluation creates a feedback loop that pushes the generator toward stronger outputs.

Four scoring criteria were defined:

Design Quality : visual cohesion, consistent mood, brand identity.

Originality : evidence of custom decisions versus template defaults.

Craftsmanship : technical execution (typography, spacing, contrast).

Functionality : usability independent of visual flair.

Each run performed 5‑15 iterations. Scores improved early and then plateaued, indicating room for further refinement. A notable example: after prompting Claude to design a website for a Dutch art museum, iteration 9 produced a clean dark‑theme landing page; iteration 10 completely re‑imagined the site as a 3‑D navigable gallery using CSS perspective—a creative leap not seen in earlier runs.

Extending to full‑stack development

The generator‑evaluator loop was mapped onto the software development lifecycle, adding a third agent—the Planner —to expand a short prompt into a full product specification. The three agents operate as follows:

Planner : receives a 1‑4 sentence prompt, expands it into a high‑level spec, and deliberately avoids over‑specifying low‑level details that could cascade errors downstream.

Generator : implements one feature per sprint using a React + Vite + FastAPI + SQLite stack, commits to git, and self‑evaluates after each sprint.

Evaluator : runs Playwright tests against the live app, checks UI, API, and database state against a detailed sprint contract (hard thresholds). Failed thresholds generate explicit bug feedback for the generator.

The sprint contract forces the generator and evaluator to agree on a concrete definition of “done” before any code is written, bridging the gap between user stories and testable implementations.

Case study 1: Retro game maker

Prompt:

"Create a 2D retro game maker with a level editor, sprite editor, entity behavior, and a playable test mode."

Results:

Solo agent : 20 minutes, $9 – partially functional game with layout waste and a critical bug (entities appeared but never responded to input).

Full harness : 6 hours, $200 – polished interface, correct entity‑input wiring, and a playable demo.

The solo run suffered from rigid workflow and missing input handling; the harness, guided by the evaluator, delivered cohesive UI, consistent visual identity, and functional gameplay.

Case study 2: Browser‑based digital audio workstation (DAW)

Prompt:

"Build a fully functional digital audio workstation (DAW) in the browser using the Web Audio API."

Stage‑by‑stage breakdown (time and cost):

Planner: 4.7 minutes, $0.46

Build (1st round): 2 h 7 min, $71.08

QA (1st round): 8.8 min, $3.24

Build (2nd round): 1 h 2 min, $36.89

QA (2nd round): 6.8 min, $3.09

Build (3rd round): 10.9 min, $5.88

QA (3rd round): 9.6 min, $4.06

Total : 3 h 50 min, $124.70

First‑round QA flagged missing core DAW interactions (drag‑dropping clips, instrument panels, visual editors), reducing the app to a showcase. Subsequent rounds added a work‑arrangement view, mixer, and transport controls, enabling the author to script a short song (tempo, key, melody, drum track, mix levels, reverb). The AI still cannot hear audio, so musical‑taste feedback remains limited.

Iterating the harness: removing sprint structure

After Opus 4.6, the sprint decomposition was dropped because the model’s native capabilities made fine‑grained sprint contracts unnecessary. The planner and evaluator remained because they still added clear value:

Without the planner, the generator under‑estimates scope and builds fewer features.

Without the evaluator, the generator omits details and leaves stubbed functionality.

The evaluator was moved to a single post‑run assessment, reducing token overhead while preserving quality checks. The evaluator is now invoked only when a task exceeds the model’s reliable independent capability.

Future outlook

As models improve, they will handle longer, more complex tasks, potentially reducing the need for heavy harness scaffolding. Conversely, stronger models expand the design space for new harness combinations that exceed baseline capabilities.

Key take‑aways:

Continuously experiment on the model —trace its behavior on real problems and tune prompts to achieve desired outcomes.

Decompose complex tasks and assign specialized agents to each sub‑problem for additional performance headroom.

Re‑evaluate harnesses when new model versions appear —strip away components that no longer add value and add new ones to unlock previously impossible capabilities.

References:

[1] Harness design for long‑running application development –

https://www.anthropic.com/engineering/harness-design-long-running-apps

[3] Front‑end design skill –

https://github.com/anthropics/claude-code/blob/main/plugins/frontend-design/skills/frontend-design/SKILL.md

[4] Long‑running programming agent Harness –

https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

[5] "Ralph Wiggum" method – https://ghuntley.com/ralph/ [14] Building Efficient AI Agents –

https://qborfy.com/ailearn/harness/anthropic-building-agents.html

GAN multi-agent systems AI engineering full-stack development long-running applications front-end design

Written by

Qborfy AI

A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Context‑window failure and reset

Front‑end design: quantifying subjective quality

Extending to full‑stack development

Case study 1: Retro game maker

Case study 2: Browser‑based digital audio workstation (DAW)

Iterating the harness: removing sprint structure

Future outlook

Qborfy AI

How this landed with the community

Was this worth your time?

0 Comments

Case study 1: Retro game maker

Case study 2: Browser‑based digital audio workstation (DAW)