Automated Service Fault Localization System Architecture
The automated service fault localization system ingests massive real‑time instrumentation data, builds call‑chain graphs, and instantly pinpoints the exact component causing timeouts or other errors, achieving developer‑level accuracy within seconds instead of minutes while remaining simple, fast, and fully automated.
Service issue investigation is a routine task for developers, but it consumes a lot of time; rapid fault resolution is critical.
The main obstacles are:
Massive alert information.
Complex call chains.
Complicated investigation process.
Reliance on experience.
These challenges can be addressed by building an experience model.
Example: an order list service depends on seller, product, and shop services; a timeout on host 127.123.12.12 causes the order list to timeout.
Key questions include accurately defining timeouts/exceptions, generating upstream/downstream call chains, pinpointing the responsible component, and distinguishing timeout, thread‑pool‑full, or unknown errors.
Underlying data instrumentation provided by Alibaba enables solutions; with this data, a fully automated fault localization system is feasible.
System Goals
The system must satisfy four goals, which are also its main challenges:
Accuracy (locating as precisely as a developer).
Speed (locating before monitoring alerts).
Simplicity (shortest path from detection to result).
Automation.
Four Modules
Data Collection
Collects and reports massive instrumentation data (up to 80 GB/min) with low latency and extensible metrics, using Alibaba Cloud SLS and custom plugins.
Real‑time Computing
Preprocesses data: links requests by unique IDs, cleanses data, and emits events. Challenges: compute latency, multi‑source coordination, data cleaning, storage cost.
Real‑time Analysis
Generates problem path graphs from events. Challenges: real‑time vs offline topology, data loss, analysis accuracy.
Aggregation Display
Aggregates problem paths in real time to reconstruct the incident scene, balancing query performance, concurrency, and storage cost.
Results
Since deployment, fault localization time dropped from 10 minutes to under 5 seconds. Example cases: (1) Xianyu product publish alert resolved in <5 s; (2) homepage slowdown due to single‑machine GC identified instantly.
Conclusion
The system focuses on service stability; future work includes richer data sources, comprehensive event abstraction, and building a knowledge graph for end‑to‑end incident handling.
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.