Operations 23 min read

How Time-Series Decomposition Boosts Microservice Root Cause Localization to 84% Accuracy

This paper presents StudRank, a microservice root‑cause localization method that decomposes call‑chain traces into time‑series, detects anomalies, builds an abnormal propagation graph, and applies a personalized PageRank random‑walk algorithm, achieving 84% top‑1 accuracy and a 97.6% improvement over MicroRCA on public AIOps data.

AsiaInfo Technology: New Tech Exploration

Mar 10, 2023

How Time-Series Decomposition Boosts Microservice Root Cause Localization to 84% Accuracy

Abstract

StudRank is a microservice root‑cause localization method that converts call‑chain logs into per‑service latency time‑series, detects anomalous metrics with kernel density estimation and interpolation, builds an anomaly sub‑graph, and ranks suspect nodes using a personalized PageRank random‑walk on a probability transition matrix. On the 2020 AIOps Challenge dataset it achieves 84 % top‑1 accuracy, a 97.6 % relative improvement over MicroRCA.

Method Overview

Data preprocessing : Extract timestamp, service name, latency, request ID, upstream ID from each trace; construct parent‑child relationships; generate per‑service latency time‑series and aggregate by minute.

Metric anomaly detection : For each time‑series, estimate a probability density function from a historical normal window using KDE; flag values outside a confidence interval. Independently compute deviation via linear interpolation; flag large deviations. A metric is anomalous if both detectors agree or if missing values occur.

Root‑cause localization : From the set of anomalous nodes, extract an anomaly sub‑graph from the dynamic service topology. Edge weights are Pearson correlation coefficients between node metrics. Normalize edge weights to obtain a transition matrix P. Define a personalization vector v from anomaly scores. Run personalized PageRank iteration r = (1‑α)Pv + αv (α is the damping factor) to obtain a ranking r of candidate root causes.

Data Preprocessing Details

Each call‑chain record contains timestamp, service_name, latency, request_id, upstream_id. The pipeline parses these fields, builds a directed call graph, and creates a latency time‑series for every service. Minute‑level averaging smooths noise while preserving fault signatures.

Metric Anomaly Detection

For a metric m , a sliding normal window W is used to fit a KDE f_m. An observation x is anomalous if f_m(x) < τ where τ is a low‑density threshold (e.g., 5 % quantile). The interpolation detector computes the expected value ĥ at the same timestamp by linear interpolation of W ; if |x‑ĥ| > δ (δ is a deviation threshold) the metric is also flagged. The final anomaly label is the logical AND of the two detectors, or true when the series contains missing points.

Random‑Walk Root‑Cause Localization

Given the anomalous node set A, the algorithm extracts the induced sub‑graph G_A from the full service call graph. For each edge (i,j) in G_A, compute Pearson correlation ρ_{ij} between the two nodes’ time‑series; set edge weight w_{ij}=|ρ_{ij}|. Normalize rows to obtain transition matrix P. The personalization vector v_i is proportional to the anomaly score of node i. Perform personalized PageRank until convergence: r ← (1‑α)·P·v + α·v Nodes with highest scores in r are reported as root‑cause candidates.

Experimental Evaluation

Dataset and Baselines

Experiments use the 2020 AIOps Challenge dataset (real‑world call‑chain traces from a telecom operator) covering network, CPU, and database faults. Baselines: a static latency threshold (1000 ms), MicroRCA (BIRCH‑based anomaly detection + personalized PageRank), and MicroHECL (dynamic service graph + Pearson correlation ranking).

Metrics

Top‑1 accuracy (whether the highest‑ranked node matches the true root cause) is reported for each fault category.

Results

StudRank attains 84 % top‑1 accuracy overall, outperforming MicroRCA by 97.6 % relative improvement. Network faults yield lower accuracy due to missing host‑level nodes; CPU faults achieve moderate performance; database faults are hardest because many failures do not manifest in service latency.

Discussion

The quality of metric anomaly detection strongly influences root‑cause ranking. Sensitivity analysis shows limited impact of KDE bandwidth and interpolation smoothing factor on final accuracy. Future work includes handling sparse call‑chains, reducing computational overhead for very large systems, and incorporating host‑level metrics.

Conclusion

StudRank demonstrates that decomposing call‑chain data into time‑series, applying lightweight KDE‑based anomaly detection, and ranking nodes with personalized PageRank yields highly accurate and scalable microservice root‑cause localization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

microservices Anomaly Detection AIOps random walk StudRank time-series decomposition

Written by

AsiaInfo Technology: New Tech Exploration

AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.