How Multi-Dimensional Root Cause Analysis Boosts Monitoring Efficiency with AI
This article introduces the challenges of multi-dimensional monitoring, explains the limitations of traditional alerting, and presents the MDRCA algorithm—combining K‑means clustering, Explanatory Power, and Surprise metrics—to pinpoint root causes efficiently, while sharing practical AI integration experiences for large‑scale monitoring platforms.
Introduction
The monitoring team at Tencent SNG built a multi‑dimensional monitoring platform and, after a two‑month research phase, developed a new Multi‑Dimensional Root Cause Analysis (MDRCA) algorithm that incorporates AI techniques to improve anomaly detection and analysis.
Characteristics of Multi‑Dimensional Monitoring Data
Traditional monitoring focuses on single entities (servers, routers) with metrics such as CPU usage or network traffic. Modern services require monitoring of virtual business modules, where a single alarm may involve many related objects, leading to high alert volume and long analysis times. To address this, objects and metrics are modeled in multiple dimensions (time, business attributes, and metric values) to form a multi‑dimensional dataset.
Typical dimensions include:
Time dimension (usually 1‑minute granularity)
Business attribute dimension (e.g., module, app version, device type, carrier, region)
Metric dimension (e.g., success rate, latency, request count)
By translating physical machines into business modules, the number of monitored objects is reduced, and hierarchical relationships enable faster root‑cause tracing.
K‑means‑Based Multi‑Dimensional Root Cause Analysis
To avoid manual inspection of each dimension, K‑means clustering is applied to success‑rate metrics. The process groups similar success‑rate patterns, highlights anomalous dimensions, and guides a second‑level analysis.
Example: three modules A, B, C with various command codes show that module B has a significantly lower success rate (≈95 %) compared to A and C, identifying B as a suspicious dimension. Further inspection of command code b1 reveals it as the primary cause of the degradation.
MDRCA Algorithm
The MDRCA algorithm extends the K‑means approach to address two shortcomings: it only works for success‑rate (a composite metric) and lacks weighting of total request volume. MDRCA introduces two metrics:
Explanatory Power (EP) : measures the contribution of a specific dimension value j under dimension i to the observed anomaly, calculated as the ratio of the deviation of the dimension value to the overall deviation.
Surprise : quantifies the change difference using Jensen‑Shannon Divergence between the proportion of the dimension value in the predicted baseline and in the observed anomaly.
Both EP and Surprise are plotted in a four‑quadrant chart; dimensions with high EP and high Surprise are selected as candidate root causes. The algorithm then aggregates candidate sets, computes average Surprise, and prioritizes dimensions with the largest contribution.
The overall MDRCA workflow is illustrated in the diagram below.
AI Application Experience
Applying AI to monitoring requires clear role definitions:
Domain Expert : provides business pain points and context.
AI Expert : designs algorithms based on the domain insights.
Algorithm Engineering Expert : implements and optimizes the algorithms for production.
Application Development Expert : integrates AI results into user‑facing tools and collects feedback.
The development process includes thorough research, reading relevant papers, and iterative communication among roles to avoid missteps and accelerate adoption.
Conclusion
The MDRCA algorithm demonstrates how multi‑dimensional data modeling, K‑means clustering, and AI‑driven metrics (EP and Surprise) can significantly reduce alert volume and analysis time in large‑scale monitoring systems. Ongoing work aims to refine the algorithm and publish further case studies.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.