Operations 41 min read

How a Multi‑Agent AI System Revolutionizes AIOps Root‑Cause Analysis

This article details a multi‑agent AIOps solution built on the Dify platform that automates fault detection, root‑cause analysis, and incident reporting by integrating metrics, logs, and trace data, dramatically reducing mean time to detect and resolve complex cloud‑native service failures.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How a Multi‑Agent AI System Revolutionizes AIOps Root‑Cause Analysis

Project Background

Enterprise digital transformation leads to micro‑service, containerized, cloud‑native architectures, generating massive metrics, logs and traces. Manual root‑cause analysis (RCA) cannot keep up.

Objectives

Build a multi‑agent AI system that automatically ingests observability data, performs intelligent RCA and delivers concise interactive results to engineers via enterprise messaging, reducing mean time to detection (MTTD) and mean time to recovery (MTTR).

Architecture

Implemented on the Dify workflow engine, the solution is organized into three logical layers:

Task Planning Layer : The “Operations Expert” agent creates a step‑by‑step investigation plan and dispatches sub‑tasks to specialized agents.

Perception Layer : Data agents retrieve real‑time metrics, logs, traces and change events from back‑ends (Prometheus, Zabbix, ARMS, SLS, CMDB) through standardized MCP services.

Analysis Decision Layer : The “Duty Officer” agent aggregates evidence, performs structured reasoning and decides whether to continue or terminate the investigation.

Agents involved: Task Planning, Metric Analysis, Log Analysis, Topology Awareness, Analysis Decision, Final Output.

System Architecture Diagram
System Architecture Diagram

Key Optimizations

Latency reduction : removed unnecessary loop iterations, added pre‑processing prompts and intermediate output modules.

Context compression : summarized historical observations and cited only essential evidence.

Dynamic loop termination : replaced fixed iteration limits with evidence‑sufficiency checks.

Stable final output : mandatory summarization step produces a structured RCA report.

ReAct Workflow

The system follows the ReAct (Reasoning + Acting) pattern. Each iteration consists of a Thought (reasoning) and an Action (tool call). Example: the Task Planning agent hypothesizes that a latency spike is caused by resource saturation, triggers the Metric agent to fetch CPU/Memory data, then the Log agent for error stacks, and finally the Trace agent for the call chain. The Duty Officer evaluates the collected evidence, decides whether the hypothesis is confirmed, and either terminates the loop or generates a new plan.

Agent Output Schema (Metric Example)

{
  "agent_type": "metric",
  "status": "success",
  "summary": "Fetched CPU utilization for ECS instance i‑abc123; usage is low, suggesting the issue lies elsewhere.",
  "data": {
    "metrics": [{
      "namespace": "acs_ecs_dashboard",
      "metricName": "AliyunEcs_CPUUtilization",
      "unit": "%",
      "tags": {"instanceId": "i‑abc123", "regionId": "cn-shanghai"},
      "values": [{"timestamp": 1758023820, "value": 63.7}]
    }]
  },
  "error_message": null
}

Tool Interfaces

All observability data are accessed via Alibaba Cloud OpenAPI MCP services (https://api.aliyun.com/mcp) and the open‑source MCP server repository (https://github.com/aliyun/alibabacloud-observability-mcp-server). These services provide unified endpoints for metric queries, log searches, trace retrieval and CMDB lookups, decoupling the large language model from underlying monitoring systems.

Performance Results

After one month of deployment the system raised root‑cause identification success from approximately 20 % to 70 %. Ongoing work focuses on further workflow latency reduction, richer knowledge‑base integration, and model‑level enhancements to improve reasoning accuracy.

Conclusion

The multi‑agent, ReAct‑driven AIOps platform demonstrates that orchestrated LLM reasoning combined with real‑time observability APIs can automate complex RCA tasks, provide actionable insights to engineers, and significantly improve operational efficiency in cloud‑native environments.

cloud-nativeMCPReActobservabilityDifyaiopsroot cause analysismulti‑agent system
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.