Operations 16 min read

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

This article presents the design, implementation, and evaluation of CloudRCA, an intelligent root cause analysis framework for Alibaba Cloud's big‑data computing services, detailing challenges such as heterogeneous data, sample imbalance, and real‑time constraints, and describing the multi‑stage data processing, hierarchical Bayesian modeling, and deployment results that reduce MTTR by 20%.

DataFunSummit
DataFunSummit
DataFunSummit
CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

Introduction The article introduces the practice of Alibaba Cloud's intelligent operations team in root cause analysis (RCA) for big‑data platforms, focusing on stability, cost, and efficiency, and outlines five key aspects: goals and challenges, heterogeneous data processing, RCA model, CloudRCA framework, and deployment experience.

Root Cause Analysis Goals and Challenges The goal is to reduce MTTR (Mean Time To Resolve) by improving the efficiency of the RCA step, which is often the most time‑consuming. Challenges include multi‑source heterogeneous data, abundant interference, sample imbalance, cross‑platform reuse, and the need for short model runtime.

Heterogeneous Data Processing Four typical data types are identified: alarm events, metrics, log data, and topology data. Metrics are transformed into binary time‑series using a multi‑period detection algorithm (RobustPeriod). Log data are compressed and clustered using FT‑Tree and deep‑learning‑based feature extraction, then converted to binary sequences. These processed streams become inputs for downstream models.

Root Cause Analysis Model Construction Three problem formulations are discussed: drill‑down analysis, root‑cause classification, and causal inference. The authors combine classification and causal inference by employing a Knowledge‑based Hierarchical Bayesian Network (KHBN) that learns causal relationships among events, metrics, and log categories. The PC algorithm is used to construct the causal graph, and conditional probabilities select the most likely root cause.

CloudRCA Framework The CloudRCA pipeline integrates event streams, metric anomaly detection, log clustering, and KHBN training with fault‑injection data and CMDB entity relationships. The final inference selects the root‑cause type with the highest conditional probability, falling back to module‑level inference when confidence is low.

Deployment Experience and Evaluation Ablation studies show that removing any feature‑engineering component (metric detection or log clustering) degrades performance. Experiments across MaxCompute, Flink, and Hologres demonstrate the importance of expert knowledge for newer platforms and the benefit of transfer learning for shared modules. CloudRCA reduces abnormal recovery time by about 20%.

Future Work Planned improvements include extracting causal paths for better interpretability, using GAN‑based active learning to reduce fault‑injection costs, and integrating knowledge‑graph techniques to combine expert knowledge with system architecture.

References [1] RobustPeriod: Robust Time‑Frequency Mining for Multiple Periodicity Detection (SIGMOD’21). [2] Prefix: Switch failure prediction in datacenter networks (2018). [3] Squeeze algorithm for multi‑dimensional root cause localization (ISSRE 2019). [4] iSQUAD for intermittent slow query diagnosis (VLDB 2020). [5] CloudRanger for cloud‑native root cause identification (CCGRID 2018). [6] PC algorithm for causal discovery (MIT Press 2000). [7] CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms (CIKM 2021).

big datacloud computingMachine Learningoperationstransfer learningcausal inferenceroot cause analysis
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.