Why Three AI Agents Beat One: Planner‑Generator‑Evaluator Architecture Explained

The article analyzes why a single AI struggles to self‑evaluate, presents Anthropic’s three‑agent (Planner, Generator, Evaluator) architecture with concrete DAW‑building examples, sprint contracts, cost‑benefit tables, and step‑by‑step processes that show how each role solves specific problems and improves overall quality.

Qborfy AI
Qborfy AI
Qborfy AI
Why Three AI Agents Beat One: Planner‑Generator‑Evaluator Architecture Explained

Problem with Single‑Agent Self‑Evaluation

Anthropic engineers discovered that when an AI both generates output and evaluates its own work, it tends to over‑rate the result because the evaluation uses the same internal expectations that guided generation, similar to an author proofreading their own text.

"When asked to evaluate its own output, the AI confidently praises it—even when human observers see clearly mediocre quality."

This issue is especially severe for subjective tasks like design or writing, where there is no objective correctness.

Three‑Agent Architecture

The solution is a triangular architecture consisting of:

Planner : expands a vague user request into a detailed product specification.

Generator : implements the specification in code, working in short, well‑defined sprints.

Evaluator : independently tests the implementation against explicit acceptance criteria.

Planner Details

The Planner takes a short description such as "make a DAW" and performs four steps:

Break the concept into 16 concrete features (timeline, track management, AI‑assisted composition, etc.).

Group the features into 10 sprints ordered by dependencies.

Proactively add AI‑enhanced features that the user did not request.

Specify *what* to build without dictating *how* (e.g., "support drag‑and‑drop" instead of "use React DnD").

The output is a specification document containing a full feature list, sprint plan, and acceptance criteria.

Generator Workflow

For each sprint the Generator follows a four‑step loop:

Sprint Contract : negotiate with the Evaluator what will be built and how success is defined.

Implementation : write the required files (e.g., Timeline.tsx, TimelineClip.tsx, useTimelineState.ts) and unit tests.

Self‑Check : run basic tests to ensure the code compiles and basic functionality works.

Feedback Loop : the Evaluator runs real‑world tests; the Generator fixes precise issues and repeats until the sprint passes.

All sprints for the DAW were completed after two feedback cycles per sprint.

Evaluator Mechanics

The Evaluator is equipped with Playwright MCP, allowing it to control a browser exactly like a human user. Its workflow:

Open the application in a browser.

Execute the test scenarios defined in the sprint contract (click, drag, switch tracks, etc.).

Capture screenshots and compare results against the acceptance criteria.

Report pass/fail with concrete evidence.

Example feedback from the first sprint highlighted three critical failures:

"Clips cannot be dragged on the timeline, there is no instrument UI panel, and the EQ/ compressor editor is missing. These are core interactions, not edge cases."

Because the feedback was specific (feature, observed behavior, and why it matters), the Generator could make targeted fixes.

Cost‑Benefit Decision Framework

Introducing multiple agents multiplies API call costs (3‑8×) and token usage (4‑10×). Anthropic provides a framework to decide when the quality gains justify the extra cost.

API Calls : single‑agent = 1×, three‑agent = 3‑8× (including feedback loops).

Token Consumption : single‑agent = baseline, three‑agent = 4‑10×.

Execution Time : single‑agent = baseline, three‑agent = 2‑5×.

Quality : for complex, subjective tasks the three‑agent approach yields high quality; for simple, deterministic tasks the gain is marginal.

Concrete numbers (using a GPT‑5.2‑Codex model) illustrate the trade‑offs:

Write a function: $0.05 vs $0.40 (≈8× cost) → negligible quality difference → not worth it.

Build a full API endpoint: $0.20 vs $1.50 (+15% quality) → worth it for important features.

Construct a complete module (e.g., DAW): $30 vs $124 (+80% usability, 60% fewer bugs) → clearly justified.

Rule of thumb: if the expected value of the output exceeds roughly ten times the additional multi‑agent cost, use the three‑agent setup.

When to Use Fewer Agents

If full three‑agent coordination is overkill, two‑agent variants are possible:

P+G (Planner → Generator): suitable when verification can be fully automated.

G+E (Generator ↔ Evaluator): suitable when the user already provides a clear specification.

P→G→E (full): for complex, zero‑to‑one projects.

G only : single‑agent for trivial, well‑defined tasks.

Implementation Example: Communication Protocol

"""
Multi‑agent communication protocol implementation

Communication method: file‑based asynchronous messaging
Data format: structured JSON
"""
import json, os, time
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional, List, Dict, Any
from enum import Enum

class AgentRole(Enum):
    PLANNER = "planner"
    GENERATOR = "generator"
    EVALUATOR = "evaluator"

class MessagePriority(Enum):
    NORMAL = "normal"
    URGENT = "urgent"
    INFO = "info"

@dataclass
class AgentMessage:
    """Message format between agents"""
    msg_id: str
    from_role: AgentRole
    to_role: AgentRole
    message_type: str
    content: Dict[str, Any]
    priority: MessagePriority
    timestamp: float
    references: List[str] = None
    metadata: Dict = None

@dataclass
class SprintContract:
    """Sprint contract – agreement before work starts"""
    sprint_id: int
    feature_name: str
    description: str
    acceptance_criteria: List[Dict]
    test_scenarios: List[Dict]
    files_to_modify: List[str]
    dependencies: List[str]
    constraints: List[str]
    estimated_complexity: str
    planned_by: str = ""
    agreed_by_generator: str = ""
    agreed_by_evaluator: str = ""
    status: str = "pending"
    result_summary: str = ""
    evaluation_report: Dict = None
    iteration_count: int = 0

class AgentMessageBus:
    """File‑system based message bus for agents"""
    def __init__(self, workspace_root: str):
        self.workspace = Path(workspace_root)
        self.dirs = {
            'inbox': self.workspace / '_agent_comm' / 'inbox',
            'outbox': self.workspace / '_agent_comm' / 'outbox',
            'shared': self.workspace / '_agent_comm' / 'shared',
            'contracts': self.workspace / '_agent_comm' / 'contracts',
            'reports': self.workspace / '_agent_comm' / 'reports',
        }
        for d in self.dirs.values():
            d.mkdir(parents=True, exist_ok=True)

    def send(self, message: AgentMessage) -> str:
        """Send a message to the target role's inbox"""
        target_inbox = self.dirs['inbox'] / message.to_role.value
        filename = f"{message.msg_id}.json"
        filepath = target_inbox / filename
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(asdict(message), f, ensure_ascii=False, indent=2)
        return str(filepath)

    def receive(self, role: AgentRole, message_type: str = None) -> List[AgentMessage]:
        """Receive all unread messages for a role"""
        inbox = self.dirs['inbox'] / role.value
        messages = []
        if not inbox.exists():
            return messages
        for filepath in sorted(inbox.glob('*.json')):
            with open(filepath, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if message_type and data.get('message_type') != message_type:
                continue
            msg = AgentMessage(**data)
            messages.append(msg)
            # move to read folder
            read_dir = inbox / '_read'
            read_dir.mkdir(exist_ok=True)
            filepath.rename(read_dir / filepath.name)
        return messages

    def save_contract(self, contract: SprintContract) -> str:
        """Persist a sprint contract"""
        filepath = self.dirs['contracts'] / f"sprint_{contract.sprint_id:03d}_contract.json"
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(asdict(contract), f, ensure_ascii=False, indent=2)
        return str(filepath)

    def save_evaluation(self, report: Dict) -> str:
        """Persist an evaluation report"""
        filepath = self.dirs['reports'] / f"eval_{report['sprint_id']:03d}_{report['evaluator_id']}.json"
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(report, f, ensure_ascii=False, indent=2)
        return str(filepath)

Practical Steps to Adopt the Architecture

Define clear responsibilities for Planner, Generator, and Evaluator.

Design concrete, measurable acceptance criteria for each feature.

Equip the Evaluator with real execution capability (e.g., Playwright).

Calibrate the Evaluator using few‑shot examples that illustrate "fail" cases.

Establish a file‑based communication protocol and sprint contract template.

Conclusion

The three‑agent architecture solves the core limitation of a single AI by delegating "what to do" to the Planner, "how to do it" to the Generator, and "how well it was done" to the Evaluator. This division of labor yields higher quality, especially for complex, subjective tasks, and the provided cost‑benefit framework helps teams decide when the extra expense is justified.

Next article: using trace analysis to drive continuous improvement of the Harness system.

Multi-agentAI Architecturecost analysisgeneratorplannersprint contractevaluator
Qborfy AI
Written by

Qborfy AI

A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.