Fault Governance in Distributed Systems: Dependency Failures, Strong/Weak Dependency, and Fault‑Injection Practices
This article presents a comprehensive overview of fault governance in large‑scale distributed systems, covering classic dependency failures, the concept of strong and weak dependencies, experimental observations, the evolution of fault‑injection techniques, and best practices for building reliable fault‑drill platforms.
The talk is divided into two parts: first, an analysis of classic dependency failures in distributed systems, their root causes, and the evolution of mitigation techniques; second, a macro‑level discussion on building a "fault‑prevention" infrastructure and designing a fault‑drill system with guiding principles and best practices.
The speaker, Zhou Yang, shares his experience at Alibaba since 2011, including work on stability products, HTTPS migration, and large‑scale events such as Double‑11, emphasizing the importance of deterministic stability in everyday operations.
Using the Alibaba product‑detail page as a case study, the presentation illustrates how a seemingly simple page depends on dozens to hundreds of downstream services, making it one of the most complex dependency graphs in the company.
Two experiments are described: (1) disabling backend services such as discount, inventory, and logistics, which results in a visually cleaner page but no obvious functional failure; (2) forcing the product‑detail service itself to fail, producing clear error pages that range from user‑friendly messages to severe outage indicators. These experiments introduce the notion of "strong" versus "weak" dependencies.
Strong dependencies cause noticeable user‑impact when downstream services fail, while weak dependencies do not affect core business or system availability. The talk defines strong/weak dependencies formally and lists practical scenarios where this classification is valuable, including system migration acceptance, rate‑limit and degradation strategies, application startup ordering, root‑cause analysis, and capacity planning.
The technical evolution of strong/weak dependency handling is traced through three stages: (1) early manual fault injection via code changes, remote debugging, and shell commands; (2) the introduction of automated testing tools (e.g., Selenium) and a "second‑environment" that isolates test traffic; (3) a modern, plug‑in‑based fault‑injection framework that intercepts requests at middleware, applies configurable fault rules, and reports detailed impact data.
Examples of the plug‑in architecture and annotation‑driven test cases are shown, demonstrating how a test can be transformed into a strong/weak dependency verification within seconds.
The presentation then shifts to fault‑drill best practices, citing high‑profile incidents (e.g., AWS S3 outage, Netflix Chaos Monkey, Alibaba’s own production failures) and explaining how systematic chaos engineering and regular drills can improve resilience.
Alibaba’s internal "MonkeyKing" project (named after the Monkey King) implements a comprehensive fault‑drill platform that injects OS‑level, process‑level, distributed, and third‑party service faults, covering the full fault model.
Finally, future directions are outlined: normalizing fault drills, categorizing fault types, and applying intelligent, data‑driven automation to continuously improve system stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
