Choosing the Right Multi‑Agent Collaboration Pattern: Supervisor, Swarm, Mesh, or Pipeline

When a single LLM agent can’t handle research, writing, and fact‑checking simultaneously, the article breaks down four multi‑agent collaboration patterns—Supervisor, Swarm, Pipeline, and Mesh—detailing their architectures, code examples, pros, cons, suitable scenarios, and common pitfalls to help you pick the best fit.

James' Growth Diary
James' Growth Diary
James' Growth Diary
Choosing the Right Multi‑Agent Collaboration Pattern: Supervisor, Swarm, Mesh, or Pipeline

Hello, I’m James. In the previous post we combined RAG, Memory, and MCP into a single LangGraph to build a production‑grade AI assistant skeleton. Readers noted that a single agent struggles when tasks become complex, which is the problem addressed here.

Running a single agent works at first, but as requirements grow—research, writing, fact‑checking all on one agent—the prompt inflates to 800 tokens, the tool list expands to 15 items, the agent picks the wrong tool, forgets its own rules, and a single mistake cascades. This is not a prompt‑writing issue; it’s the inherent conflict of assigning multiple roles to one agent.

The solution is to let multiple agents cooperate, each handling a single responsibility. However, cooperation must have a clear structure; random coupling leads to “Multi‑Chaos” rather than a true multi‑agent system.

01 Why a Single Agent Hits a Wall

Three typical failure modes of a single agent:

Context window pollution : An agent with 12 tools consumes hundreds of tokens for tool descriptions. By step 7, key information from step 2 is pushed out of the context, causing the agent to “forget”.

Role confusion : The agent is asked to "research + write code + write summary". The system prompt contains three competing instruction sets, so research is incomplete before writing starts, and the writing tone leaks into the code.

Fault propagation : An error at step 3 contaminates steps 4‑10 because there is no isolation or independent validation; the whole chain must be rerun.

Multi‑agent design solves this by ensuring each agent does one thing and communicates through a clear interface; an error only affects a single node.

02 Supervisor Mode: One Controller, Many Specialists

Supervisor architecture: central routing and specialist division
Supervisor architecture: central routing and specialist division

Core idea : A central Supervisor agent receives the user request, decides which Worker (Researcher, Writer, Fact‑Checker) to invoke, gathers the result, and repeats until the task is complete. Data flow: User → Supervisor → Workers → Supervisor → Final Answer. Workers never talk to each other; all information passes through the Supervisor, which acts as both router and aggregator.

import { ChatOpenAI } from "@langchain/openai");
import { createReactAgent } from "@langchain/langgraph/prebuilt");
import { createSupervisor } from "@langchain/langgraph-supervisor");
import { tool } from "@langchain/core/tools");
import { z } from "zod";

const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

const searchWeb = tool(
  async ({ query }: { query: string }) => `Search results for: ${query}`,
  { name: "search_web", description: "Search the web for information", schema: z.object({ query: z.string() }) }
);

const writeReport = tool(
  async ({ content }: { content: string }) => `# Report

${content}`,
  { name: "write_report", description: "Format findings into a structured report", schema: z.object({ content: z.string() }) }
);

const researchAgent = createReactAgent({ llm: model, tools: [searchWeb], name: "researcher", prompt: "You are a research specialist. Search for information and return factual summaries." });
const writerAgent   = createReactAgent({ llm: model, tools: [writeReport], name: "writer",   prompt: "You are a writing specialist. Format research findings into clear reports." });

const supervisorGraph = await createSupervisor({ llm: model, agents: [researchAgent, writerAgent], prompt: "You are a team supervisor. Delegate research to researcher first, then writing to writer." }).compile();

const result = await supervisorGraph.invoke({ messages: [{ role: "user", content: "Research LangGraph and write a report." }] });
console.log(result.messages.at(-1)?.content);
Supervisor execution sequence: from request dispatch to result aggregation
Supervisor execution sequence: from request dispatch to result aggregation

Advantages : Strong controllability—routing logic is centralized, so a failing worker is easy to trace. The downside is that the Supervisor can become a bottleneck; if task decomposition is wrong, every downstream worker receives incorrect instructions.

Suitable scenarios : Customer‑service ticket routing, content‑generation pipelines, code‑review workflows—where sub‑task boundaries are clear and central coordination is needed.

03 Swarm Mode: No Central Controller, Agents Pass the Ball

Swarm topology: decentralized handoff
Swarm topology: decentralized handoff

Core idea : There is no central Supervisor. After completing its part, Agent A decides which Agent B should handle the next step and hands over both control and context. The workflow emerges from these handoffs (e.g., User → Triage → Billing or Tech → Final Answer) rather than being pre‑planned.

import { createSwarm, createHandoffTool } from "@langchain/langgraph-swarm");
import { createReactAgent } from "@langchain/langgraph/prebuilt");

const triageAgent = createReactAgent({
  llm: model,
  tools: [
    createHandoffTool({ agentName: "billing_agent", description: "Transfer to billing for payment issues" }),
    createHandoffTool({ agentName: "tech_agent",   description: "Transfer to tech support for technical issues" })
  ],
  name: "triage_agent",
  prompt: `You are a triage agent.
- Billing/payment issues → handoff to billing_agent
- Technical problems → handoff to tech_agent
- General questions → handle directly`
});

const billingAgent = createReactAgent({ llm: model, tools: [createHandoffTool({ agentName: "triage_agent", description: "Return to triage" })], name: "billing_agent", prompt: "You are a billing specialist. Return to triage if off‑topic." });
const techAgent    = createReactAgent({ llm: model, tools: [createHandoffTool({ agentName: "triage_agent", description: "Return to triage" })], name: "tech_agent",    prompt: "You are a technical support specialist. Return to triage if off‑topic." });

const swarm = createSwarm({ agents: [triageAgent, billingAgent, techAgent], defaultActiveAgent: "triage_agent" }).compile({ recursionLimit: 25 });

const result = await swarm.invoke({ messages: [{ role: "user", content: "My payment failed but the system shows it succeeded." }] });
Swarm handoff mechanism: control and context transfer path
Swarm handoff mechanism: control and context transfer path

The handoff is performed via createHandoffTool, which transfers control and context without interruption.

Biggest pitfall : Infinite loops. Agent A may hand off to B, and B may hand back to A, causing a cycle. Setting recursionLimit: 25 is a production safeguard.

Suitable scenarios : Conversational routing where the next user intent is unpredictable, such as dynamic customer‑service flows.

04 Pipeline Mode: Linear Assembly Line, Each Step Handles the Next

Pipeline mode: serial processing and validation gates
Pipeline mode: serial processing and validation gates

Pipeline is the simplest and most predictable of the four. Agents form a chain: Raw Data → Extractor → Enricher → Analyzer → Reporter → Output. Each node only sees the upstream output; there is no backward communication or skipping steps.

import { StateGraph, START, END } from "@langchain/langgraph";

interface PipelineState {
  rawText: string;
  extractedEntities: string[];
  enrichedData: Record<string, unknown>[];
  analysis: string;
  finalReport: string;
}

async function extractorNode(state: PipelineState): Promise<Partial<PipelineState>> {
  const response = await model.invoke([
    { role: "system", content: "Extract key entities. Return as JSON array." },
    { role: "user",   content: state.rawText }
  ]);
  const entities = JSON.parse(response.content as string);
  return { extractedEntities: entities };
}

async function enricherNode(state: PipelineState): Promise<Partial<PipelineState>> {
  if (!state.extractedEntities.length) {
    throw new Error("Extraction failed: no entities. Pipeline stopped.");
  }
  const enriched = state.extractedEntities.map(entity => ({ entity, context: `Enriched data for: ${entity}` }));
  return { enrichedData: enriched };
}

async function analyzerNode(state: PipelineState): Promise<Partial<PipelineState>> {
  const response = await model.invoke([
    { role: "system", content: "Analyze patterns and identify key insights." },
    { role: "user",   content: JSON.stringify(state.enrichedData) }
  ]);
  return { analysis: response.content as string };
}

async function reporterNode(state: PipelineState): Promise<Partial<PipelineState>> {
  const response = await model.invoke([
    { role: "system", content: "Generate a human‑readable report." },
    { role: "user",   content: state.analysis }
  ]);
  return { finalReport: response.content as string };
}

const pipeline = new StateGraph<PipelineState>({ channels: {/* ... */} })
  .addNode("extractor", extractorNode)
  .addNode("enricher",  enricherNode)
  .addNode("analyzer",  analyzerNode)
  .addNode("reporter",  reporterNode)
  .addEdge(START, "extractor")
  .addEdge("extractor", "enricher")
  .addEdge("enricher", "analyzer")
  .addEdge("analyzer", "reporter")
  .addEdge("reporter", END)
  .compile();
Pipeline error propagation: cascade contamination without validation gates
Pipeline error propagation: cascade contamination without validation gates

Biggest issue : Error propagation. If the first step fails, all downstream steps receive polluted data. Therefore each node must perform integrity checks, as shown in the enricherNode example, which throws an error when extraction yields no entities.

Suitable scenarios : ETL data processing, batch document handling, any deterministic workflow where step N’s output is step N+1’s input.

05 Mesh Mode: Fully Connected, Most Flexible but Hardest to Tame

Mesh topology: fully connected agents with dynamic routing
Mesh topology: fully connected agents with dynamic routing

Mesh offers the greatest flexibility: any agent can call any other agent at any time, without a preset order—e.g., Research ↔ Writer ↔ Fact‑Checker ↔ Summarizer. Dynamic routing is implemented in LangGraph via the Command object.

import { Command, StateGraph, MessagesState, START, END } from "@langchain/langgraph";

async function researchNode(state: MessagesState): Promise<Command> {
  const response = await model.invoke([
    { role: "system", content: `You are a research agent.
After research, decide who handles next.
Reply with: NEXT: writer_agent | NEXT: fact_checker | NEXT: END` },
    ...state.messages
  ]);
  const content = response.content as string;
  const match = content.match(/NEXT:\s*(\w+)/);
  const goto = match?.[1] ?? END;
  return new Command({ update: { messages: [response] }, goto: goto === "END" ? END : goto });
}

async function writerNode(state: MessagesState): Promise<Command> {
  const response = await model.invoke([
    { role: "system", content: `You are a writer. Write based on research.
After writing, route as needed: NEXT: fact_checker | NEXT: summarizer | NEXT: END` },
    ...state.messages
  ]);
  const content = response.content as string;
  const match = content.match(/NEXT:\s*(\w+)/);
  const goto = match?.[1] ?? END;
  return new Command({ update: { messages: [response] }, goto: goto === "END" ? END : goto });
}

const meshGraph = new StateGraph(MessagesState)
  .addNode("research_agent", researchNode, { ends: ["writer_agent", "fact_checker", END] })
  .addNode("writer_agent",   writerNode,   { ends: ["fact_checker", "summarizer", END] })
  .addEdge(START, "research_agent")
  .compile({ recursionLimit: 50 });
Mesh full‑trace: debugging requires complete tracing infrastructure
Mesh full‑trace: debugging requires complete tracing infrastructure

Real advantage : Handles tasks that need dynamic decisions mid‑workflow—e.g., discovering a need for additional research after writing a draft.

Cost : Debugging is extremely painful; execution paths are unpredictable, making it hard to locate bugs among many LLM calls. Production must integrate LangSmith Tracing as a non‑optional foundation.

Suitable scenarios : Complex creative collaborations, research tasks where the next step is only known during execution.

06 Horizontal Comparison of the Four Modes

Supervisor – Central routing; ★★★★★ predictability, low debugging difficulty; best for clear‑boundary tasks needing coordination.

Swarm – Peer‑to‑peer handoff; ★★★☆☆ predictability, medium debugging difficulty; fits unpredictable conversational routing.

Pipeline – Linear serial; ★★★★★ predictability, ★☆☆☆☆ debugging difficulty; ideal for deterministic ETL‑style pipelines.

Mesh – Fully connected dynamic; ★★☆☆☆ predictability, ★★★★★ debugging difficulty; suited for highly dynamic, creative workflows.

07 Common Pitfalls

Pitfall 1: Supervisor becomes a new monolithic agent – If the Supervisor prompt exceeds ~300 words, it is taking on business logic. Fix: Supervisor should only make routing decisions; keep its prompt short.

Pitfall 2: Swarm handoff loops – Agents may hand off back and forth endlessly. Fix: Set recursionLimit: 25 and give each agent explicit "I do not handle X" instructions.

Pitfall 3: Pipeline "partial success" illusion – Downstream agents may generate plausible output from incomplete upstream data, causing hidden failures. Fix: Each node validates not just format but also presence of required fields.

Pitfall 4: Cost explosion unnoticed – A single Supervisor call plus three Worker calls equals at least four LLM requests; retries can push it to ten. Fix: Use a strong model for the Supervisor and cheaper models for Workers; tag each agent to track token usage and set budget alerts.

Pitfall 5: Mesh without tracing – Multiple agents make 12 LLM calls without correlated IDs, making debugging impossible. Fix: Integrate LangSmith Tracing before production.

Conclusion

Supervisor is the default starting point: centralized control makes routing clear and troubleshooting easy, so it should be considered first.

Swarm fits scenarios where dialogue flow is unpredictable; handoff is key but requires a recursion limit.

Pipeline is the simplest choice for linear tasks; keep each node’s integrity checks to prevent error propagation.

Mesh offers maximal flexibility for dynamic collaborations, but the debugging cost is high; proper tracing is mandatory before production.

Cost grows with the complexity of the chosen pattern: use strong models for Supervisors, weaker models for Workers, and perform load testing before deployment.

Next, the second article in the Multi‑Agent series will dive into LangGraph Swarm, explaining how createHandoffTool works under the hood and sharing production‑grade pitfalls.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SwarmPipelineMulti-AgentMeshSupervisorLangGraph
James' Growth Diary
Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.