Operations 21 min read

How Alibaba Engineers Fault Governance and Chaos Engineering for E‑commerce

This article recounts Alibaba's middleware team's QCon Beijing 2017 presentation on fault governance and fault‑drill practices, covering distributed‑system dependency failures, strong/weak dependency concepts, multi‑stage technical evolution, and the design of their chaos‑engineering platform for large‑scale e‑commerce.

Alibaba Cloud Developer

May 12, 2017

How Alibaba Engineers Fault Governance and Chaos Engineering for E‑commerce

At QCon Beijing 2017, Alibaba middleware expert Zhou Yang (nickname Zhong Ting) delivered a talk titled “Alibaba E‑commerce Fault Governance and Fault‑Drill Practices,” which was later voted the conference’s star lecture.

The presentation is divided into two parts: the first analyzes classic distributed‑system dependency failures, introduces governance solutions and technical evolution; the second discusses the importance of building a "fault‑prevention" facility and the principles of fault‑drill systems.

Distributed‑system dependency failures and strong/weak dependencies

The speaker defines strong dependencies as those whose downstream failures cause noticeable impact on core business, while weak dependencies do not affect core availability. He illustrates this with experiments on the product‑detail page, showing how removing discount, inventory, or logistics data may be invisible to users (weak), whereas a missing product leads to a clear error (strong).

Technical evolution of dependency governance

Three stages are described:

2008–2011: Manual fault injection via code changes, remote debugging, and shell commands, with high cost and coarse granularity.

2012–2014: Introduction of isolated test environments, Selenium‑based recording, distributed tracing, and middleware upgrades, reducing manual effort.

2014 onward: Focus on business impact and system design, implementing strong/weak dependency detection via middleware plugins and a centralized rule service, enabling automated fault injection without code changes.

Key scenarios for dependency governance include system upgrade acceptance, throttling and degradation strategies, application startup ordering, root‑cause localization, and capacity assessment.

Fault‑drill principles and best practices

The talk references industry incidents (e.g., AWS S3 outage, GitLab database deletion) and Netflix’s Chaos Monkey suite (Chaos Monkey, Latency Monkey, Chaos Gorilla) as inspiration for building Alibaba’s own fault‑drill platform, named “MonkeyKing.”

MonkeyKing consists of OS‑level fault plugins for hardware failures, in‑process plugins for application‑level faults, and server‑side controls for distributed faults, with extensible APIs for custom fault scenarios.

During the 2017 Double‑11 shopping festival, fault‑drill was applied across multiple quadrants of traffic, covering pre‑plan validation, alert verification, fault reproduction, disaster‑recovery testing, parameter tuning, fault‑model training, and blue‑team/red‑team exercises.

The future work aims at normalizing drills, categorizing fault types, and automating intelligent drills based on architecture and business analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Operations chaos engineering fault tolerance

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.