Artificial Intelligence 14 min read

How to Test Multi‑Agent Systems? Mock LLM and Graph Replay Explained

The article analyzes why testing Multi‑Agent systems is difficult—due to LLM output randomness, cross‑node state propagation, and tool side‑effects—and presents a systematic solution using mock LLMs, MemorySaver checkpoints with graph replay, tool stubs, and a three‑layer testing pyramid while highlighting common pitfalls and best practices.

James' Growth Diary

May 8, 2026

How to Test Multi‑Agent Systems? Mock LLM and Graph Replay Explained

Why Multi‑Agent testing is hard: three fundamental obstacles

Multi‑Agent systems face three core testing challenges that differ from ordinary business logic: (1) LLM output randomness—identical prompts to GPT‑4 produce different responses, breaking the assumption of deterministic unit tests; (2) cross‑node state propagation—agents share a mutable State that must be constructed for each node test; (3) tool call side‑effects—real tools (search APIs, database writes, email sends) can pollute production data if not isolated.

Multi-Agent testing three obstacles analysis

Resolving these obstacles requires three techniques: mock LLMs to make output predictable, fixed State construction to test nodes independently, and tool stubs to isolate side‑effects.

Mock LLM with LangChain

LangChain provides FakeListChatModel and FakeChatModel, which act like scripted “playback machines”. They implement the same interface as a real LLM but return a predefined sequence of responses, enabling 100 % deterministic tests.

import { FakeListChatModel, FakeChatModel } from "@langchain/core/utils/testing";
import { AIMessage } from "@langchain/core/messages";

// Scenario A: text‑only response (Supervisor → ResearchAgent → WriterAgent)
const fakeLLM = new FakeListChatModel({
  responses: [
    JSON.stringify({ next: "ResearchAgent" }),
    "研究完成：RAG 优化方案包括 Rerank、混合检索、分块策略三块...",
    "Final Answer: 基于研究结果，最终建议采用混合检索方案。"
  ]
});

// Scenario B: tool‑call support
const fakeLLMWithToolCall = new FakeChatModel({
  responses: [
    new AIMessage({ content: "", tool_calls: [{ name: "searchWeb", args: { query: "RAG optimization" }, id: "call_001", type: "tool_call" }] }),
    new AIMessage({ content: "根据搜索，答案是 RAG + Rerank。" })
  ]
});

// Replace the real LLM in the agent
const testAgent = createReactAgent({ llm: fakeLLM, tools: [mockSearchTool] });
const result = await testAgent.invoke({ messages: [{ role: "user", content: "如何优化 RAG？" }] });
expect(result.messages.at(-1)?.content).toContain("Rerank");

Choosing the right mock depends on the test scenario: use FakeListChatModel for pure text output (simpler code) and FakeChatModel when you need to verify tool‑call flows.

MemorySaver and graph replay for multi‑turn testing

MemorySaver

stores checkpoints of the graph’s State in process memory and automatically cleans up after the test, eliminating external dependencies.

Graph replay allows a bug that occurs at node 4 of a 5‑node graph to be re‑executed without rerunning the entire graph: restore the checkpoint from node 3 and run only nodes 4 and 5.

import { MemorySaver } from "@langchain/langgraph";
import { createSupervisorGraph } from "../src/supervisor-graph";

// Multi‑turn conversation test
const checkpointer = new MemorySaver();
const graph = createSupervisorGraph({ llm: buildFakeLLM(), checkpointer });
const config = { configurable: { thread_id: `test-${Date.now()}` } };
await graph.invoke({ messages: [{ role: "user", content: "研究 RAG 优化" }] }, config);
const state = await graph.getState(config);
expect(state.values.nextAgent).toBe("ResearchAgent"); // first turn routing correct

// Replay from a failing node
const history = await graph.getStateHistory({ configurable: { thread_id: "prod-crashed-abc" } });
const lastGoodCheckpoint = history.find(h => h.next.includes("writerAgent"));
const replayResult = await graph.invoke(null, {
  configurable: {
    thread_id: "replay-debug-001",
    checkpoint_id: lastGoodCheckpoint?.config?.configurable?.checkpoint_id
  }
});
expect(replayResult.messages.at(-1)?.content).not.toContain("Error");

MemorySaver multi‑turn state flow and replay timeline

Tool stubs and node unit tests

A tool stub replaces a real tool with a mock object that has the same name but deterministic behavior and records call counts for assertions.

Node unit tests invoke a single agent node directly with a fixed State, bypassing the full graph, and verify both state changes and tool usage.

import { tool } from "@langchain/core/tools";
import { z } from "zod";
import { researchAgentNode } from "../src/nodes/research-agent";

let searchCallCount = 0;
const stubSearchTool = tool(async ({ query }) => {
  searchCallCount++;
  return `关于 "${query}" 的固定结果：RAG 优化有三种策略...`;
}, {
  name: "searchWeb",
  description: "搜索互联网",
  schema: z.object({ query: z.string() })
});

describe("ResearchAgent node", () => {
  it("writes search result to state and routes to WriterAgent", async () => {
    const outputState = await researchAgentNode(
      { messages: [], task: "research RAG", research: null, nextAgent: "ResearchAgent" },
      { llm: fakeLLM, tools: [stubSearchTool] }
    );
    expect(outputState.research).toContain("RAG");
    expect(outputState.nextAgent).toBe("WriterAgent");
    expect(searchCallCount).toBe(1);
  });

  it("records error and returns to Supervisor when tool fails", async () => {
    const failingTool = tool(async () => { throw new Error("API 超时"); }, {
      name: "searchWeb",
      description: "...",
      schema: z.object({ query: z.string() })
    });
    const outputState = await researchAgentNode(
      { messages: [], task: "...", research: null, nextAgent: "ResearchAgent" },
      { llm: fakeLLM, tools: [failingTool] }
    );
    expect(outputState.errors).toContain("API 超时");
    expect(outputState.nextAgent).toBe("Supervisor");
  });
});

Tool stub isolation and node unit test diagram

Three‑layer testing pyramid

Combining the above tools yields a three‑layer testing strategy:

Node unit tests – >100 tests, < 100 ms each, run on every commit.

Graph integration tests – 20‑30 tests, run per PR, verify routing logic.

E2E tests – 3‑5 tests, run daily, use real LLM and real tools for smoke testing.

// Integration test example – verify routing to ResearchAgent
const supervisorLLM = new FakeListChatModel({
  responses: [JSON.stringify({ next: "ResearchAgent" })]
});
const config = { configurable: { thread_id: `routing-${Date.now()}` } };
await supervisorGraph.invoke({ messages: [{ role: "user", content: "帮我搜索 LangGraph 文档" }] }, config);
const state = await supervisorGraph.getState(config);
expect(state.values.lastExecutedAgent).toBe("ResearchAgent");
expect(state.values.writerDraft).toBeUndefined(); // WriterAgent should not fire

Common pitfalls

Five frequent mistakes and their remedies:

Responses exhausted – FakeListChatModel throws when its responses array is empty. Print the execution trace to count LLM calls and provide enough scripted responses.

thread_id leakage – sharing the same thread_id across tests leaks state. Generate a unique ID each run, e.g., test-${Date.now()}-${Math.random()}.

Stub tool name mismatch – the stub’s name must exactly match the real tool’s name; otherwise the agent reports “tool not found”.

Only asserting final output – use graph.getState() to check intermediate state and pinpoint the failing node.

Testing only the happy path – include scenarios for tool timeouts, malformed LLM responses, and node exceptions to reflect production failures.

Conclusion

Mock LLMs ( FakeListChatModel / FakeChatModel) are the cornerstone of stable Multi‑Agent testing, turning nondeterministic LLM output into a predictable script.

Graph replay with MemorySaver reduces bug‑reproduction cost to zero by restoring checkpoints and re‑executing only the failing segment.

The three‑layer pyramid distributes testing effort: fast node unit tests, medium‑speed graph integration tests, and slower E2E smoke tests, preventing over‑reliance on any single layer.

Accurate thread_id generation and exact tool‑name matching are the two details that cause the most failures; handling them early saves extensive debugging time.

Future work will explore Agentic RAG, where agents decide whether and how often to retrieve information, turning RAG from passive querying into proactive reasoning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LangChain Multi-Agent Testing Pyramid Graph Replay Mock LLM Tool Stubs

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.