Operations 21 min read

How ByteDance Scales Chaos Engineering with Scenario‑Driven Proactive Experiments

This article explains ByteDance's journey from basic fault‑injection testing to a production‑grade, scenario‑driven proactive chaos engineering platform that automates experiments, defines stability metrics, controls blast radius, and continuously validates service dependencies to improve system resilience.

dbaplus Community

Aug 21, 2021

How ByteDance Scales Chaos Engineering with Scenario‑Driven Proactive Experiments

Background

Chaos engineering originated with Netflix’s Chaos Monkey in 2010, but adoption remains limited to a few large enterprises. Distributed and micro‑service architectures increase the need for reliability, making automated fault‑injection experiments a critical capability.

Scenario‑Driven Proactive Experimentation

To move from manual fault‑injection testing (FIT) toward full automation, ByteDance defines a transition stage called scenario‑driven proactive experimentation . The approach starts from the ultimate reliability goal, derives technical specifications and standards, and incrementally guides teams onto a high‑speed path that satisfies both the “skill” (effectiveness & safety) and “application” (coverage breadth & depth) dimensions of the Chaos Engineering Maturity Model.

Construction Process

The process consists of:

Clarify the final reliability goal (e.g., verify strong vs. weak service dependencies).

Design stage‑specific technical specifications and standards.

Build a generic experiment scenario that the chaos platform can execute with controlled risk.

Iteratively refine the scenario, standards, and automation pipeline.

Key Capabilities Required

Continuous execution of experiments in production with strict blast‑radius control.

Selection of a high‑impact, universally applicable experiment scenario.

Definition of a universal stability metric (e.g., a stability field in response payloads).

Automatic detection of metric deviations during experiments.

Optional automatic termination when the blast radius is confirmed to be within limits.

Automation Principles

Stable‑state assumption: Services expose a metric that remains stable under normal operation; any deviation indicates a fault.

Production execution: Experiments run on a small traffic slice (5‑10 QPS) in a canary cluster to limit impact.

Diverse fault injection: Simulate a range of failure types, though the demand is lower than in classic FIT.

Minimize blast radius: Use weight‑adjustable canary clusters to isolate affected instances.

Continuous pipeline: Chain goal definition, traffic selection, experiment execution, stability detection, reporting, and feedback into an automated workflow.

Stability Metric Detection

Initial attempts used machine‑learning models, but sparse data (2‑4 points per 60‑120 s experiment) and short durations made them ineffective. The production solution combines multiple statistical rules with dynamically generated thresholds based on recent historical data. Noise filtering is performed by correlating the stability metric with related service metrics; only when both move together is the change attributed to the experiment.

Experiment Reporting and Guarantees

After each run, the platform aggregates:

Inferred strong/weak dependency results.

Execution context (target instance, traffic weight, fault type).

Stability‑metric visualizations.

Detection outcomes.

The report is sent to the service owner for confirmation, comment, or remediation. Confirmed results are fed back to the service‑governance platform. A periodic “guarantee” job re‑executes the experiment to detect any drift in dependency relationships.

Implementation at ByteDance

The first scenario validates strong versus weak service dependencies. A weak dependency (e.g., cache miss) does not affect overall availability, while a strong dependency (e.g., downstream service outage) does. Detecting these relationships helps prevent incidents caused by improper coupling in high‑QPS services.

Technical details include:

Service‑level stability field added to response payloads; callers report this as a metric.

Canary clusters with adjustable weight are used to limit experiment traffic to 5‑10 QPS.

Dynamic threshold generation: recent historical metric curves define acceptable variance; deviations beyond the threshold trigger instability flags.

Multi‑metric correlation: if both the stability metric and a downstream failure metric rise together, the change is considered experiment‑induced, filtering out random noise.

Metric Detection Algorithm

Two detection strategies were evaluated:

AB‑test style comparison of experiment vs. baseline curves (requires sophisticated traffic slicing and isolation).

Real‑time trend analysis using statistical rules (chosen for current stage).

The chosen method computes a dynamic variance window from recent data, then checks each new data point against this window. If a point exceeds the window, the service is marked unstable. Correlated metric comparison further reduces false positives.

Blast‑Radius Control

ByteDance’s production environment provides a dedicated canary cluster for gray‑release testing. By adjusting the cluster’s weight, the platform can direct 5‑10 QPS of traffic to a single instance, ensuring sufficient signal for stability‑metric detection while keeping impact minimal.

Workflow Overview

The end‑to‑end flow includes goal definition, target selection, fault injection, metric monitoring, result inference, reporting, and feedback. The following diagrams illustrate the architecture and process:

Scenario‑Driven Experiment Capability Topology

ByteDance Chaos Engineering Architecture

References

Netflix Chaos Monkey: https://github.com/Netflix/chaosmonkey

Principles of Chaos Engineering: http://principlesofchaos.org/?lang=ENcontent

ChaosBlade: https://github.com/chaosblade-io/chaosblade

Chaos Mesh: https://github.com/pingcap/chaos-mesh

ByteDance chaos engineering summary: http://mp.weixin.qq.com/s/kZ_sDdrbc-_trVLNCWXyYw

microservices Metrics chaos-engineering production testing scenario testing

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.