Fault Drills and Chaos Engineering Practices for Enhancing System Stability
The initiative introduces fault‑drill and chaos‑engineering practices—defining steady‑state metrics, injecting real‑world failures in controlled experiments, automating continuous production tests, and limiting blast radius—to detect weaknesses early, accelerate fault location and recovery, boost emergency response metrics, and foster a resilient engineering culture.
Overview: To improve system stability, accelerate fault detection, location, and recovery, and to establish an executable, easy‑to‑operate fault‑drill specification that can be generalized and portable, a fault‑drill initiative was launched.
1. Significance of Fault Drills
1.1 Chaos Engineering – Before discussing fault drills, it is necessary to understand chaos engineering. Netflix introduced Chaos Monkey in 2012, popularizing the concept. Chaos engineering is a carefully planned experiment to expose system weaknesses. In simple terms, it injects abnormal disturbances into a steady‑state system, observes the resulting changes, and devises countermeasures so that future similar disturbances cause minimal delta in space and time.
Chaos engineering is not a one‑off experiment; it follows a cyclic improvement mechanism based on the “Principles of Chaos Engineering,” which emphasize five basic elements:
1) Establish a hypothesis of system steady‑state operation. A measurable definition of “steady state” is required (e.g., TPS, response time, latency, error rate).
2) Diversify real‑world events. Variables reflect real‑world timing and include hardware failures, software errors, traffic spikes, or scaling events that can break steady state.
3) Run experiments in production. Since system behavior varies with environment and traffic patterns, sampling real traffic is the only reliable way to capture request paths. Chaos recommends conducting experiments on live production traffic to ensure relevance.
4) Continuous automated execution. Manual experiments are labor‑intensive and unsustainable; automation drives orchestration and analysis.
5) Minimize blast radius. Experiments should avoid unnecessary pain to customers; the impact must be controlled and limited.
2. Fault‑Drill Implementation Path
2.1 Core‑Link Degradation Prevention – The drill starts with protecting core links from degradation caused by routine iterations. Preparatory work includes:
Cataloguing core frontend actions.
Mapping core business scenarios.
Defining core service Service Level (S1).
Strictly controlling promotion and demotion of services into/out of S1.
Following the five chaos‑engineering principles, the experiment proceeds in four steps:
1) Define “steady state” as measurable indicators of normal system operation. A core‑link regression case is used as the steady‑state metric.
2) Assume the steady state persists in the control group (no injected faults) throughout the experiment.
3) Introduce variables that reflect real‑world events (e.g., server crashes, timeouts). Currently only crash events are injected; future work will add timeouts, NPEs, etc.
4) Compare the control and experimental groups to refute the hypothesis that the steady state remains unchanged. The larger the difficulty in refuting the hypothesis, the greater the confidence in system behavior. Discovering a weakness provides a target for improvement before it manifests in production.
Because the current infrastructure does not fully meet the five principles, certain gaps exist:
Experiments are not yet run in production.
Automation is not fully realized; experiments are performed manually per iteration.
Nevertheless, the team validates the steady‑state hypothesis by ensuring the regression case passes completely, thereby confirming system stability.
2.2 Improving Emergency Response Efficiency – The goal is to boost the 5‑5‑10 metric (monitoring alert latency, fault‑location speed, and recovery speed). The experiment follows the five chaos‑engineering elements:
Select measurable outputs (core business KPIs, TPS, error rates) as the steady‑state baseline.
Inject diversified real‑world events (currently crashes, later timeouts and NPEs).
Conduct experiments directly in production to reflect true system behavior.
Maintain continuous, regular fault experiments (automation is still in progress).
Minimize blast radius by limiting fault injection to a few selected scenarios and confining impact to test traffic.
2.3 Promoting Chaos‑Engineering Culture
Activities include publishing S1 offline fault‑drill operation standards, Green‑Pass system fault‑drill procedures, and team knowledge‑sharing sessions on processes and tooling.
3. Recommendations for the Drill Process
Since fault drills compare control and experimental groups, it is essential to control environmental variables, perform pre‑checks, and ensure that any observed differences stem solely from the injected events.
Post‑experiment validation must guarantee that neither online users (real ride‑hailing customers) nor offline testers are adversely affected.
---
The article concludes with a call to action for readers to like, follow the “哈啰技术” public account, and share the content.
HelloTech
Official Hello technology account, sharing tech insights and developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
