Why Three AI Agents Beat One: Planner‑Generator‑Evaluator Architecture Explained
The article analyzes why a single AI struggles to self‑evaluate, presents Anthropic’s three‑agent (Planner, Generator, Evaluator) architecture with concrete DAW‑building examples, sprint contracts, cost‑benefit tables, and step‑by‑step processes that show how each role solves specific problems and improves overall quality.
Problem with Single‑Agent Self‑Evaluation
Anthropic engineers discovered that when an AI both generates output and evaluates its own work, it tends to over‑rate the result because the evaluation uses the same internal expectations that guided generation, similar to an author proofreading their own text.
"When asked to evaluate its own output, the AI confidently praises it—even when human observers see clearly mediocre quality."
This issue is especially severe for subjective tasks like design or writing, where there is no objective correctness.
Three‑Agent Architecture
The solution is a triangular architecture consisting of:
Planner : expands a vague user request into a detailed product specification.
Generator : implements the specification in code, working in short, well‑defined sprints.
Evaluator : independently tests the implementation against explicit acceptance criteria.
Planner Details
The Planner takes a short description such as "make a DAW" and performs four steps:
Break the concept into 16 concrete features (timeline, track management, AI‑assisted composition, etc.).
Group the features into 10 sprints ordered by dependencies.
Proactively add AI‑enhanced features that the user did not request.
Specify *what* to build without dictating *how* (e.g., "support drag‑and‑drop" instead of "use React DnD").
The output is a specification document containing a full feature list, sprint plan, and acceptance criteria.
Generator Workflow
For each sprint the Generator follows a four‑step loop:
Sprint Contract : negotiate with the Evaluator what will be built and how success is defined.
Implementation : write the required files (e.g., Timeline.tsx, TimelineClip.tsx, useTimelineState.ts) and unit tests.
Self‑Check : run basic tests to ensure the code compiles and basic functionality works.
Feedback Loop : the Evaluator runs real‑world tests; the Generator fixes precise issues and repeats until the sprint passes.
All sprints for the DAW were completed after two feedback cycles per sprint.
Evaluator Mechanics
The Evaluator is equipped with Playwright MCP, allowing it to control a browser exactly like a human user. Its workflow:
Open the application in a browser.
Execute the test scenarios defined in the sprint contract (click, drag, switch tracks, etc.).
Capture screenshots and compare results against the acceptance criteria.
Report pass/fail with concrete evidence.
Example feedback from the first sprint highlighted three critical failures:
"Clips cannot be dragged on the timeline, there is no instrument UI panel, and the EQ/ compressor editor is missing. These are core interactions, not edge cases."
Because the feedback was specific (feature, observed behavior, and why it matters), the Generator could make targeted fixes.
Cost‑Benefit Decision Framework
Introducing multiple agents multiplies API call costs (3‑8×) and token usage (4‑10×). Anthropic provides a framework to decide when the quality gains justify the extra cost.
API Calls : single‑agent = 1×, three‑agent = 3‑8× (including feedback loops).
Token Consumption : single‑agent = baseline, three‑agent = 4‑10×.
Execution Time : single‑agent = baseline, three‑agent = 2‑5×.
Quality : for complex, subjective tasks the three‑agent approach yields high quality; for simple, deterministic tasks the gain is marginal.
Concrete numbers (using a GPT‑5.2‑Codex model) illustrate the trade‑offs:
Write a function: $0.05 vs $0.40 (≈8× cost) → negligible quality difference → not worth it.
Build a full API endpoint: $0.20 vs $1.50 (+15% quality) → worth it for important features.
Construct a complete module (e.g., DAW): $30 vs $124 (+80% usability, 60% fewer bugs) → clearly justified.
Rule of thumb: if the expected value of the output exceeds roughly ten times the additional multi‑agent cost, use the three‑agent setup.
When to Use Fewer Agents
If full three‑agent coordination is overkill, two‑agent variants are possible:
P+G (Planner → Generator): suitable when verification can be fully automated.
G+E (Generator ↔ Evaluator): suitable when the user already provides a clear specification.
P→G→E (full): for complex, zero‑to‑one projects.
G only : single‑agent for trivial, well‑defined tasks.
Implementation Example: Communication Protocol
"""
Multi‑agent communication protocol implementation
Communication method: file‑based asynchronous messaging
Data format: structured JSON
"""
import json, os, time
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional, List, Dict, Any
from enum import Enum
class AgentRole(Enum):
PLANNER = "planner"
GENERATOR = "generator"
EVALUATOR = "evaluator"
class MessagePriority(Enum):
NORMAL = "normal"
URGENT = "urgent"
INFO = "info"
@dataclass
class AgentMessage:
"""Message format between agents"""
msg_id: str
from_role: AgentRole
to_role: AgentRole
message_type: str
content: Dict[str, Any]
priority: MessagePriority
timestamp: float
references: List[str] = None
metadata: Dict = None
@dataclass
class SprintContract:
"""Sprint contract – agreement before work starts"""
sprint_id: int
feature_name: str
description: str
acceptance_criteria: List[Dict]
test_scenarios: List[Dict]
files_to_modify: List[str]
dependencies: List[str]
constraints: List[str]
estimated_complexity: str
planned_by: str = ""
agreed_by_generator: str = ""
agreed_by_evaluator: str = ""
status: str = "pending"
result_summary: str = ""
evaluation_report: Dict = None
iteration_count: int = 0
class AgentMessageBus:
"""File‑system based message bus for agents"""
def __init__(self, workspace_root: str):
self.workspace = Path(workspace_root)
self.dirs = {
'inbox': self.workspace / '_agent_comm' / 'inbox',
'outbox': self.workspace / '_agent_comm' / 'outbox',
'shared': self.workspace / '_agent_comm' / 'shared',
'contracts': self.workspace / '_agent_comm' / 'contracts',
'reports': self.workspace / '_agent_comm' / 'reports',
}
for d in self.dirs.values():
d.mkdir(parents=True, exist_ok=True)
def send(self, message: AgentMessage) -> str:
"""Send a message to the target role's inbox"""
target_inbox = self.dirs['inbox'] / message.to_role.value
filename = f"{message.msg_id}.json"
filepath = target_inbox / filename
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(asdict(message), f, ensure_ascii=False, indent=2)
return str(filepath)
def receive(self, role: AgentRole, message_type: str = None) -> List[AgentMessage]:
"""Receive all unread messages for a role"""
inbox = self.dirs['inbox'] / role.value
messages = []
if not inbox.exists():
return messages
for filepath in sorted(inbox.glob('*.json')):
with open(filepath, 'r', encoding='utf-8') as f:
data = json.load(f)
if message_type and data.get('message_type') != message_type:
continue
msg = AgentMessage(**data)
messages.append(msg)
# move to read folder
read_dir = inbox / '_read'
read_dir.mkdir(exist_ok=True)
filepath.rename(read_dir / filepath.name)
return messages
def save_contract(self, contract: SprintContract) -> str:
"""Persist a sprint contract"""
filepath = self.dirs['contracts'] / f"sprint_{contract.sprint_id:03d}_contract.json"
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(asdict(contract), f, ensure_ascii=False, indent=2)
return str(filepath)
def save_evaluation(self, report: Dict) -> str:
"""Persist an evaluation report"""
filepath = self.dirs['reports'] / f"eval_{report['sprint_id']:03d}_{report['evaluator_id']}.json"
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(report, f, ensure_ascii=False, indent=2)
return str(filepath)Practical Steps to Adopt the Architecture
Define clear responsibilities for Planner, Generator, and Evaluator.
Design concrete, measurable acceptance criteria for each feature.
Equip the Evaluator with real execution capability (e.g., Playwright).
Calibrate the Evaluator using few‑shot examples that illustrate "fail" cases.
Establish a file‑based communication protocol and sprint contract template.
Conclusion
The three‑agent architecture solves the core limitation of a single AI by delegating "what to do" to the Planner, "how to do it" to the Generator, and "how well it was done" to the Evaluator. This division of labor yields higher quality, especially for complex, subjective tasks, and the provided cost‑benefit framework helps teams decide when the extra expense is justified.
Next article: using trace analysis to drive continuous improvement of the Harness system.
Qborfy AI
A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
