How ByteDance Scales Chaos Engineering with Scenario‑Driven Proactive Experiments
This article explains ByteDance's journey from basic fault‑injection testing to a production‑grade, scenario‑driven proactive chaos engineering platform that automates experiments, defines stability metrics, controls blast radius, and continuously validates service dependencies to improve system resilience.
Background
Chaos engineering originated with Netflix’s Chaos Monkey in 2010, but adoption remains limited to a few large enterprises. Distributed and micro‑service architectures increase the need for reliability, making automated fault‑injection experiments a critical capability.
Scenario‑Driven Proactive Experimentation
To move from manual fault‑injection testing (FIT) toward full automation, ByteDance defines a transition stage called scenario‑driven proactive experimentation . The approach starts from the ultimate reliability goal, derives technical specifications and standards, and incrementally guides teams onto a high‑speed path that satisfies both the “skill” (effectiveness & safety) and “application” (coverage breadth & depth) dimensions of the Chaos Engineering Maturity Model.
Construction Process
The process consists of:
Clarify the final reliability goal (e.g., verify strong vs. weak service dependencies).
Design stage‑specific technical specifications and standards.
Build a generic experiment scenario that the chaos platform can execute with controlled risk.
Iteratively refine the scenario, standards, and automation pipeline.
Key Capabilities Required
Continuous execution of experiments in production with strict blast‑radius control.
Selection of a high‑impact, universally applicable experiment scenario.
Definition of a universal stability metric (e.g., a stability field in response payloads).
Automatic detection of metric deviations during experiments.
Optional automatic termination when the blast radius is confirmed to be within limits.
Automation Principles
Stable‑state assumption: Services expose a metric that remains stable under normal operation; any deviation indicates a fault.
Production execution: Experiments run on a small traffic slice (5‑10 QPS) in a canary cluster to limit impact.
Diverse fault injection: Simulate a range of failure types, though the demand is lower than in classic FIT.
Minimize blast radius: Use weight‑adjustable canary clusters to isolate affected instances.
Continuous pipeline: Chain goal definition, traffic selection, experiment execution, stability detection, reporting, and feedback into an automated workflow.
Stability Metric Detection
Initial attempts used machine‑learning models, but sparse data (2‑4 points per 60‑120 s experiment) and short durations made them ineffective. The production solution combines multiple statistical rules with dynamically generated thresholds based on recent historical data. Noise filtering is performed by correlating the stability metric with related service metrics; only when both move together is the change attributed to the experiment.
Experiment Reporting and Guarantees
After each run, the platform aggregates:
Inferred strong/weak dependency results.
Execution context (target instance, traffic weight, fault type).
Stability‑metric visualizations.
Detection outcomes.
The report is sent to the service owner for confirmation, comment, or remediation. Confirmed results are fed back to the service‑governance platform. A periodic “guarantee” job re‑executes the experiment to detect any drift in dependency relationships.
Implementation at ByteDance
The first scenario validates strong versus weak service dependencies. A weak dependency (e.g., cache miss) does not affect overall availability, while a strong dependency (e.g., downstream service outage) does. Detecting these relationships helps prevent incidents caused by improper coupling in high‑QPS services.
Technical details include:
Service‑level stability field added to response payloads; callers report this as a metric.
Canary clusters with adjustable weight are used to limit experiment traffic to 5‑10 QPS.
Dynamic threshold generation: recent historical metric curves define acceptable variance; deviations beyond the threshold trigger instability flags.
Multi‑metric correlation: if both the stability metric and a downstream failure metric rise together, the change is considered experiment‑induced, filtering out random noise.
Metric Detection Algorithm
Two detection strategies were evaluated:
AB‑test style comparison of experiment vs. baseline curves (requires sophisticated traffic slicing and isolation).
Real‑time trend analysis using statistical rules (chosen for current stage).
The chosen method computes a dynamic variance window from recent data, then checks each new data point against this window. If a point exceeds the window, the service is marked unstable. Correlated metric comparison further reduces false positives.
Blast‑Radius Control
ByteDance’s production environment provides a dedicated canary cluster for gray‑release testing. By adjusting the cluster’s weight, the platform can direct 5‑10 QPS of traffic to a single instance, ensuring sufficient signal for stability‑metric detection while keeping impact minimal.
Workflow Overview
The end‑to‑end flow includes goal definition, target selection, fault injection, metric monitoring, result inference, reporting, and feedback. The following diagrams illustrate the architecture and process:
References
Netflix Chaos Monkey: https://github.com/Netflix/chaosmonkey
Principles of Chaos Engineering: http://principlesofchaos.org/?lang=ENcontent
ChaosBlade: https://github.com/chaosblade-io/chaosblade
Chaos Mesh: https://github.com/pingcap/chaos-mesh
ByteDance chaos engineering summary: http://mp.weixin.qq.com/s/kZ_sDdrbc-_trVLNCWXyYw
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
