Operations 12 min read

ShenTan: Automated Fault Localization System for Online Services

ShenTan is an automated fault‑localization platform for online services that quickly (under five seconds) pinpoints server‑side issues with developer‑level accuracy by aggregating real‑time metrics, applying a decision‑tree model enriched by expert knowledge and dynamic thresholds, and presenting results through an integrated alert and visualization system, while planning broader endpoint coverage and multi‑tenant support.

Xianyu Technology

Jul 28, 2020

ShenTan: Automated Fault Localization System for Online Services

Online service incidents often cause massive time waste and low‑efficiency problem solving, damaging user experience and company interests. To improve stability and accelerate development, a system that can quickly locate faults is needed.

ShenTan is an online troubleshooting tool that automatically locates server‑side stability issues and assists rapid fault resolution.

Goals

Accuracy: locate faults with precision comparable to developers.

Speed: provide results earlier than monitoring alerts.

Simplicity: shorten the chain from detection to result, reducing time and cost.

Automation: fully automated workflow without developer involvement.

Competitor analysis

Expert‑experience decision trees: mature but limited by expert bias.

Monitoring‑platform optimization: high‑sensitivity alerts generate many false alarms.

Machine‑learning root‑cause analysis: high accuracy but requires heavy data storage and complex models.

Architecture

The system consists of four modules: data collection, real‑time computation, real‑time analysis, and aggregated display. It builds a decision‑tree model to analyze faults, turning troubleshooting experience into a reusable paradigm.

Decision‑tree model

Atomic nodes include:

Monitoring object – services, hosts, databases, JVM, middleware, logs, network, etc.

Metric – load, CPU, memory, I/O, latency, timeout, log patterns.

Rule – logical condition built from metrics.

Knowledge base – distilled expert experience for root‑cause analysis.

Inference steps: collect and aggregate metrics into a TSDB, locate the root node for the target object, evaluate rules, traverse the tree, and output knowledge‑base results.

Data cleaning & dynamic thresholds

To control data cost (up to 80 GB/min), ShenTan applies coarse filtering based on custom thresholds and automated threshold maintenance using historical data. A second‑order exponential smoothing algorithm is planned to improve long‑term prediction.

Topology completion

When data loss occurs, a knowledge‑graph module supplements missing links, improving availability. Specific cases such as thread‑pool‑full exceptions are enriched with expert rules.

Current performance

Typical fault‑location time < 5 seconds.

Integrated with alert and pre‑alert platforms for second‑level root‑cause detection.

Supports downstream dependencies, DB, containers, single‑machine anomalies, and multi‑cause analysis.

Future outlook

Extend to mobile, H5, and PC endpoints for comprehensive coverage.

Multi‑tenant expansion to serve applications beyond Xianyu.

Enrich DB metrics to pinpoint faulty hosts, IPs, databases, and tables.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Big Data Automation Fault Localization Operations Decision Tree

Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.