ShenTan: Automated Fault Localization System for Online Services
ShenTan is an automated fault‑localization platform for online services that quickly (under five seconds) pinpoints server‑side issues with developer‑level accuracy by aggregating real‑time metrics, applying a decision‑tree model enriched by expert knowledge and dynamic thresholds, and presenting results through an integrated alert and visualization system, while planning broader endpoint coverage and multi‑tenant support.
Online service incidents often cause massive time waste and low‑efficiency problem solving, damaging user experience and company interests. To improve stability and accelerate development, a system that can quickly locate faults is needed.
ShenTan is an online troubleshooting tool that automatically locates server‑side stability issues and assists rapid fault resolution.
Goals
Accuracy: locate faults with precision comparable to developers.
Speed: provide results earlier than monitoring alerts.
Simplicity: shorten the chain from detection to result, reducing time and cost.
Automation: fully automated workflow without developer involvement.
Competitor analysis
Expert‑experience decision trees: mature but limited by expert bias.
Monitoring‑platform optimization: high‑sensitivity alerts generate many false alarms.
Machine‑learning root‑cause analysis: high accuracy but requires heavy data storage and complex models.
Architecture
The system consists of four modules: data collection, real‑time computation, real‑time analysis, and aggregated display. It builds a decision‑tree model to analyze faults, turning troubleshooting experience into a reusable paradigm.
Decision‑tree model
Atomic nodes include:
Monitoring object – services, hosts, databases, JVM, middleware, logs, network, etc.
Metric – load, CPU, memory, I/O, latency, timeout, log patterns.
Rule – logical condition built from metrics.
Knowledge base – distilled expert experience for root‑cause analysis.
Inference steps: collect and aggregate metrics into a TSDB, locate the root node for the target object, evaluate rules, traverse the tree, and output knowledge‑base results.
Data cleaning & dynamic thresholds
To control data cost (up to 80 GB/min), ShenTan applies coarse filtering based on custom thresholds and automated threshold maintenance using historical data. A second‑order exponential smoothing algorithm is planned to improve long‑term prediction.
Topology completion
When data loss occurs, a knowledge‑graph module supplements missing links, improving availability. Specific cases such as thread‑pool‑full exceptions are enriched with expert rules.
Current performance
Typical fault‑location time < 5 seconds.
Integrated with alert and pre‑alert platforms for second‑level root‑cause detection.
Supports downstream dependencies, DB, containers, single‑machine anomalies, and multi‑cause analysis.
Future outlook
Extend to mobile, H5, and PC endpoints for comprehensive coverage.
Multi‑tenant expansion to serve applications beyond Xianyu.
Enrich DB metrics to pinpoint faulty hosts, IPs, databases, and tables.
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.