Artificial Intelligence 26 min read

How to Build Multi‑Agent Collaboration Systems with AutoGen, CrewAI, and a Custom Orchestration Framework

This article walks through the design, pitfalls, and best‑practice architecture of multi‑agent LLM workflows, comparing AutoGen, CrewAI, and a home‑grown orchestration stack, and provides concrete code, evaluation metrics, and selection guidance for production use.

MaGe Linux Operations

Jun 21, 2026

How to Build Multi‑Agent Collaboration Systems with AutoGen, CrewAI, and a Custom Orchestration Framework

Problem observed in a naïve three‑agent research assistant

A bug caused the workflow to run 47 rounds, consuming >8000 tokens and costing $14 before being stopped. The failure modes were:

Infinite loop between Reviewer and Researcher.

Hallucinated citation that the Reviewer accepted.

Tool‑call timeout without retry or fallback.

Context explosion: each round appended the full history, reaching >8000 tokens by round 30.

No mechanism for human intervention.

Why a multi‑agent system?

Tasks that require distinct roles, tool integration, iterative refinement, parallel sub‑tasks, robust failure handling, and optional human‑in‑the‑loop cannot be solved reliably with a single LLM call.

Single‑agent vs multi‑agent example

Single‑agent prompt (fails):

你是一个 AI 研究员，请联网搜索 Qwen 和 Llama 3 的最新评测，然后对比它们的中文能力，输出 Markdown 格式的报告。

Issues:

Model knowledge cutoff; only one search per call.

No role separation; all work is done sequentially.

Cannot run searches in parallel.

Multi‑agent solution :

Planner splits the request into three parallel searches (Qwen, Llama 3, benchmark list).

Two Researcher agents execute the searches concurrently.

Writer merges the three results.

Reviewer validates facts and asks for revisions.

Result: higher quality, lower latency, and controlled token usage—provided the orchestration is correctly designed.

Five‑layer orchestration model

Interface Layer : Web UI / API / Slack / CLI – initiates tasks, shows progress, collects feedback.

Orchestration Layer : State machine, task queue, progress tracking; decides the next agent, handles termination and human‑in‑the‑loop.

Message Bus : Redis Streams, Kafka, or in‑memory queue – publishes task, tool‑call, and state‑change events.

Agent Pool : Each agent = LLM + system prompt + tools + memory (Researcher, Writer, Reviewer, …).

Tool & Data Layer : Search APIs, databases, code execution, file system; registered in a ToolRegistry with timeout, retries, and fallback logic.

Full task flow example (research‑assistant)

[User] "调研国内 LLM 推理框架"
↓
[Orchestrator] 创建 Task (state=initialized)
↓
[Orchestrator] 调用 Planner → 生成子任务: 搜索 vLLM、搜索 TGI、搜索 TensorRT‑LLM
↓
[Orchestrator] 将子任务放入 Message Bus (并行)
↓
[Researcher #1] 执行 vLLM 搜索 → vllm_research.md
[Researcher #2] 执行 TGI 搜索 → tgi_research.md
[Researcher #3] 执行 TensorRT‑LLM 搜索 → trtllm_research.md
↓ (等待全部完成)
[Orchestrator] 触发 Writer → 合并结果 → draft_v1.md
↓
[Orchestrator] 触发 Reviewer → 验证 draft (最多 2 次修订)
↓
[Orchestrator] 完成，返回最终报告给用户

Key code & configuration

AutoGen (code‑generation focus)

# autogen_code_review.py
import autogen

config_list = [{"model": "gpt-4o", "api_key": "..."}]

# Developer agent
developer = autogen.AssistantAgent(
    name="Developer",
    llm_config={"config_list": config_list},
    system_message="""你是 Python 开发者。接到需求后写代码、跑测试。"""
)

# Reviewer agent
reviewer = autogen.AssistantAgent(
    name="Reviewer",
    llm_config={"config_list": config_list},
    system_message="""你是代码审查员。审 Developer 的代码，指出 bug、性能问题、安全问题。
通过标准：所有测试通过 + 无明显安全问题 + 无 O(n³) 以上复杂度。"""
)

# UserProxy enables optional human input
user_proxy = autogen.UserProxyAgent(
    name="UserProxy",
    human_input_mode="TERMINATE",
    code_execution_config={"work_dir": "code_workspace"},
    max_consecutive_auto_reply=5
)

# Prevent infinite loops
groupchat = autogen.GroupChat(
    agents=[user_proxy, developer, reviewer],
    messages=[],
    max_round=8,
    speaker_selection_method="round_robin"
)

manager = autogen.GroupChatManager(groupchat=groupchat, llm_config={"config_list": config_list})

user_proxy.initiate_chat(manager, message="实现一个函数：输入 JSON，输出每个 key 的深度")

AutoGen key settings max_round – hard limit to avoid endless loops. speaker_selection_method="round_robin" – more deterministic than "auto". max_consecutive_auto_reply – caps how many times a single agent can reply without human input. human_input_mode="TERMINATE" or "ALWAYS" – enables human‑in‑the‑loop.

CrewAI (report‑generation focus)

# crewai_research.py
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool, FileReadTool

search_tool = SerperDevTool(api_key="...")
file_tool = FileReadTool()

researcher = Agent(
    role="高级研究员",
    goal="找到关于 {topic} 的最新、最准确的信息",
    backstory="你是一名资深技术分析师，擅长从海量资料中提炼关键信息。",
    tools=[search_tool, file_tool],
    verbose=True,
    allow_delegation=False
)

writer = Agent(
    role="技术作家",
    goal="把研究成果写成清晰、结构化的 Markdown 报告",
    backstory="你是一名技术作家，擅长把复杂概念讲清楚。",
    verbose=True
)

reviewer = Agent(
    role="主编",
    goal="审稿、纠错、确保报告质量",
    backstory="你是一名严谨的主编，会从事实准确性、逻辑性、可读性三个维度审稿。",
    verbose=True
)

research_task = Task(
    description="调研 {topic} 的最新进展，输出结构化笔记",
    expected_output="Markdown 格式的研究笔记，包含 5 个关键发现",
    agent=researcher
)

write_task = Task(
    description="基于研究笔记撰写完整报告",
    expected_output="3000 字以内的 Markdown 报告",
    agent=writer,
    context=[research_task]
)

review_task = Task(
    description="审稿，输出修改意见或 PASS",
    expected_output="JSON: {passed: bool, issues: [...]}",
    agent=reviewer,
    context=[write_task]
)

crew = Crew(
    agents=[researcher, writer, reviewer],
    tasks=[research_task, write_task, review_task],
    process=Process.sequential,
    max_iterations=10,
    verbose=2
)

result = crew.kickoff(inputs={"topic": "国内 LLM 推理框架"})

CrewAI key settings process=Process.sequential – avoids uncontrolled hierarchical loops. allow_delegation=False – prevents agents from endlessly delegating to each other. max_iterations – hard cap on total rounds.

Explicit context=[prev_task] declares dependencies.

Custom orchestration vs frameworks (comparison)

Onboarding speed : AutoGen ★★★★, CrewAI ★★★★★, Custom ★★ (requires code).

Process controllability : AutoGen ★★★, CrewAI ★★★, Custom ★★★★★.

Tool calling : AutoGen ★★★★★, CrewAI ★★★★, Custom ★★★★★.

State management : AutoGen ★★ (relies on GroupChat), CrewAI ★★★, Custom ★★★★★.

Cost control : AutoGen ★★ (token blow‑up), CrewAI ★★, Custom ★★★★★.

Human‑AI collaboration : AutoGen ★★★★★, CrewAI ★★★, Custom ★★★★★.

Large‑scale parallelism : AutoGen ★★, CrewAI ★★, Custom ★★★★★.

Maintenance cost : AutoGen ★★★, CrewAI ★★★, Custom ★★ (self‑maintained).

Selection advice

1‑3 agents, fixed flow, low token cost → CrewAI .

Strong tool integration, code execution, auto‑testing → AutoGen .

Massive parallelism, complex state, strict budget, human‑in‑the‑loop → Custom orchestration .

Post‑launch evaluation

Effectiveness metrics

Task completion rate – source: Orchestrator logs – healthy ≥ 85 %.

Average rounds per task – source: Orchestrator logs – healthy ≤ 8 rounds.

Human‑intervention rate – source: User behavior logs – healthy ≤ 20 %.

Final draft human score – source: Reviewer + business side – healthy ≥ 4.0/5.0.

Hallucination rate – source: Spot checks + LLM‑as‑judge – healthy ≤ 5 %.

Cost metrics

Average cost per task – source: LLM API usage – healthy ≤ $0.50.

Monthly total cost – source: PromptLayer / custom metrics – must stay within budget.

Token utilization (output/input) – source: same as above – healthy ≥ 0.3.

Cache hit rate – source: Redis – healthy ≥ 30 %.

Stability metrics

Average task latency – source: Orchestrator – healthy ≤ 60 s.

P95 latency – source: Orchestrator – healthy ≤ 120 s.

Tool failure rate – source: ToolRegistry – healthy ≤ 2 %.

Loop/timeout rate (max_round hit) – source: Orchestrator – healthy ≤ 1 %.

Illegal state transitions – source: Orchestrator – must be 0.

Launch checklist

All state‑machine transitions have unit tests.

Each tool implements timeout, retry, and fallback.

Cost circuit‑breaker placed on the main orchestrator loop. max_round, max_iterations, and max_consecutive_auto_reply are configured.

Human‑in‑the‑loop notifications (Slack, WebSocket, email) are verified.

Golden tasks (5 normal + 5 adversarial) run end‑to‑end.

Monitoring alerts for cost spikes, loops, and tool failures are set.

Rollback procedure to previous prompt version is ready.

Common pitfalls & solutions

Pitfall 1 – Infinite "kick‑the‑ball" loops

Symptom : 47 rounds, token burn.

Root cause : AutoGen speaker_selection_method="auto" lets the LLM choose the next speaker, causing polite back‑and‑forth.

Fixes :

Set a hard max_round / max_iterations.

Use speaker_selection_method="round_robin" or a custom selector.

Add an explicit state‑machine transition to failed after N rounds.

Pitfall 2 – Context explosion

Symptom : 8000+ tokens on round 30.

Root cause : Full history appended to every prompt.

Solution : Implement a sliding‑window + summarization manager that keeps only recent turns and a compact summary of older turns.

class ContextManager:
    def __init__(self, max_recent_turns=5, summary_max_tokens=500):
        self.max_recent = max_recent_turns
        self.summary_tokens = summary_max_tokens
        self.history = []
        self.summary = ""

    def add(self, role, content):
        self.history.append({"role": role, "content": content})
        if len(self.history) > self.max_recent * 2:
            old = self.history[:len(self.history) - self.max_recent * 2]
            self.summary = self._summarize(old + [{"summary": self.summary}])
            self.history = self.history[-self.max_recent * 2:]

    def get_context(self):
        return f"Conversation summary:
{self.summary}

Recent turns:
" + "
".join(
            f"{m['role']}: {m['content']}" for m in self.history)

Pitfall 3 – Tool hallucination & no fallback

Symptom : Search tool returns fabricated data.

Fix : Ensure the tool returns an empty string on failure and let the agent explicitly acknowledge missing information. Add a fallback implementation (e.g., local knowledge base).

class Tool:
    def run(self, **kwargs):
        result = self._safe_call(**kwargs)
        if not result or "no results" in result.lower():
            return "[搜索无结果] 请换个关键词或承认信息缺失。"
        return result

Also add to the agent prompt: "If the tool returns nothing, admit you don't know instead of fabricating."

Pitfall 4 – Reviewer always passes

Symptom : Drafts never get rejected.

Root cause : LLM bias toward agreement.

Fix : Provide a strict system prompt that forces JSON output with explicit failure reasons.

reviewer_system_prompt = """你是主编。审稿必须严格遵守以下标准：
1. 所有事实必须有 source 引用；
2. 数据点必须有具体数字；
3. 逻辑推理链完整。
输出 JSON：{passed: bool, issues: [{type: string, location: string}]}.
如果有任何问题，必须 passed=false。"""

Pitfall 5 – Cost runaway

Symptom : Monthly bill jumps from $500 to $8000.

Root cause : Repeated searches, rewrites, and unchecked context growth.

Solution : Introduce a cost circuit‑breaker that checks estimated cost before each step and records actual spend.

class CostCircuitBreaker:
    def __init__(self, budget_usd: float):
        self.budget = budget_usd
        self.spent = 0.0

    def check(self, estimated_cost: float):
        if self.spent + estimated_cost > self.budget:
            raise BudgetExceededError(
                f"将超出预算 ${self.budget:.2f}，当前已花 ${self.spent:.2f}"
            )

    def record(self, actual_cost: float):
        self.spent += actual_cost
        if self.spent > self.budget * 0.8:
            logger.warning(f"⚠️ 成本已达预算 {self.spent/self.budget:.0%}")

Optimization directions

Current : Serial agents, manual state machine, human spot‑check only.

Mid‑term : Explicit parallelism via message bus + worker pool; visual state‑machine platform; LLM‑as‑judge + spot‑check; prompt versioning; model routing (small model for simple steps).

Long‑term : DAG‑based orchestration with automatic scheduling; self‑healing routing on failures; fully automated A/B testing & prompt optimization; agent distillation (small agents mimic large ones).

Cheat sheet (quick reference)

┌──────────────────────────────────────────────────────────┐
│  多 Agent 系统速查                                         │
├──────────────────────────────────────────────────────────┤
│ 选型   3 Agent 报告类 → CrewAI                           │
│         强工具代码类 → AutoGen                            │
│         大规模/严格控成本 → 自研编排                     │
│ 上限   max_round / max_iterations 必须设                     │
│ 工具   Tool: 超时 + 重试 + 降级                           │
│ 上下文 滑动窗口 + 摘要，避免 token 爆炸                     │
│ 状态机 transitions 库 / 自研                           │
│ 成本   CostCircuitBreaker + 单任务预算                     │
│ 评测   完成率 + 轮次 + 人工评分 + LLM‑as‑judge            │
│ 监控   循环率、token 单价、工具失败率、幻觉率           │
└──────────────────────────────────────────────────────────┘

Final takeaway

The difficulty of a multi‑agent system lies not in the LLM calls themselves but in the orchestration – a solid state machine, message bus, tool registry, and cost‑circuit‑breaker let you scale from three agents to dozens without the system collapsing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Multi-Agent Systems Cost Control Orchestration AutoGen CrewAI

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem observed in a naïve three‑agent research assistant

Why a multi‑agent system?

Single‑agent vs multi‑agent example

Five‑layer orchestration model

Full task flow example (research‑assistant)

Key code & configuration

AutoGen (code‑generation focus)

CrewAI (report‑generation focus)

Custom orchestration vs frameworks (comparison)

Post‑launch evaluation

Effectiveness metrics

Cost metrics

Stability metrics

Launch checklist

Common pitfalls & solutions

Pitfall 1 – Infinite "kick‑the‑ball" loops

Pitfall 2 – Context explosion

Pitfall 3 – Tool hallucination & no fallback

Pitfall 4 – Reviewer always passes

Pitfall 5 – Cost runaway

Optimization directions

Cheat sheet (quick reference)

Final takeaway

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Pitfall 1 – Infinite "kick‑the‑ball" loops

Pitfall 2 – Context explosion

Pitfall 3 – Tool hallucination & no fallback

Pitfall 4 – Reviewer always passes

Pitfall 5 – Cost runaway