A Comprehensive Study of Failure Modes in Large‑Language‑Model Based Multi‑Agent Systems
This paper presents a systematic investigation of failure patterns in LLM‑driven multi‑agent systems, introducing a 14‑type taxonomy (MASFT) derived from over 150 annotated dialogues, evaluating it with an LLM‑as‑a‑judge pipeline, and exploring modest intervention strategies while releasing all data and tools for future research.
Although large‑language‑model (LLM) based agentic systems have attracted significant attention, their performance gains over single‑agent baselines on common benchmarks remain limited, highlighting the need to understand the challenges that hinder multi‑agent system (MAS) effectiveness.
We conducted the first comprehensive study of MAS failures by collecting more than 150 execution dialogues from five popular open‑source MAS frameworks, involving six professional annotators. Through grounded‑theory analysis we identified 14 distinct failure modes, which we organized into three high‑level categories, forming the MAS Failure Taxonomy (MASFT). Inter‑annotator agreement reached a Cohen’s Kappa of 0.88, confirming the reliability of the taxonomy.
To enable scalable evaluation, we combined MASFT with a “LLM‑as‑a‑judge” approach using OpenAI’s o1 model. On a set of 10 dialogues, the automated judgments achieved a Cohen’s Kappa of 0.77 compared with expert annotations, and few‑shot prompting improved accuracy to 94%.
We examined two simple mitigation strategies—enhancing specification prompts and redesigning agent orchestration—showing modest improvements (e.g., a 14% accuracy boost for ChatDev) but insufficient to resolve all failure cases, indicating that deeper structural changes are required.
The analysis revealed that failures arise from (1) specification and system‑design errors, (2) inter‑agent misalignment, and (3) inadequate verification and termination mechanisms. Failure distributions differ across systems, and many issues mirror those observed in high‑reliability organizations.
Our main contributions are: (1) the MASFT taxonomy for diagnosing MAS failures, (2) a scalable LLM‑as‑a‑judge evaluation pipeline, (3) empirical evaluation of prompt‑based and architectural interventions, and (4) the open release of all annotated dialogues, annotation pipelines, and expert‑annotated samples to foster further research.
Overall, building robust and reliable multi‑agent LLM systems demands attention not only to model capabilities but also to role definition, communication protocols, verification processes, and system design principles.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.