Operations 21 min read

ByteDance’s Chaos Engineering Practice and Platform Evolution

This article describes ByteDance’s multi‑generation chaos engineering practice, covering industry background, fault‑injection models, the design of a declarative fault‑center, experiment selection principles, detailed experiment processes, metric classifications, red‑blue war‑game workflows, strong/weak dependency analysis, and future directions for infrastructure‑level chaos engineering.

DataFunTalk

Apr 27, 2020

ByteDance’s Chaos Engineering Practice and Platform Evolution

Chaos engineering uses fault injection to expose weak points in distributed systems, improving stability. With the rise of micro‑services and cloud‑native architectures, system complexity and unpredictable failures have increased, making chaos engineering essential.

Industry Practice : Netflix pioneered chaos engineering and released the book *Chaos Engineering: Netflix’s System Resilience* and the open‑source Chaos Monkey. Alibaba contributed ChaosBlade, PingCAP released Chaos Mesh, and Gremlin offers a commercial chaos platform.

ByteDance’s Practice is divided into three generations:

First Generation : A simple disaster‑recovery rehearsal platform that injected network‑level faults and performed threshold‑based metric checks. It lacked extensibility for other fault types and detailed fault‑scope descriptions.

Second Generation : Introduced an extensible fault‑center with a declarative fault model (Target, Scope Filter, Dependency, Action) and separated fault injection from experiment orchestration. The fault‑center consists of API Server, Scheduler, Controller, and etcd storage.

Third Generation : Added automated metric analysis (machine‑learning‑based anomaly detection), strong/weak dependency discovery, and red‑blue war‑game capabilities. The platform now supports richer fault types, automated metric observation, and systematic practice workflows.

Fault Model defines:

Target – the microservice under test.

Scope Filter – the explosion radius (e.g., specific cluster, instance, or traffic slice).

Dependency – the downstream service or resource to affect.

Action – the concrete fault (e.g., CPU burn, latency increase).

Example declarative specification:

spec. // microservice A, 10% of instances in cluster1 experience CPU saturation
    target("application A").
    cluster_scope_filter("cluster1").
    percent_scope_filter("10%").
    dependency("cpu").
    action("cpu_burn").
    end_at("2020-04-19 13:36:23")

spec. // microservice B, downstream service C latency +200ms
    target("application B").
    cluster_scope_filter("cluster2").
    dependency("application C").
    action("delay, 200ms").
    end_at("2020-04-19 13:36:23")

Experiment Selection Principles include progressing from offline to production, from small to large scope, from past incidents to future scenarios, and from workdays to off‑hours.

Experiment Process consists of pre‑experiment checks (ensure resilience patterns are in place), experiment execution (monitor metrics, adjust scope/intensity as needed), and post‑experiment analysis (identify weak points, validate fallback plans, discover performance thresholds, and prune false alarms).

Metric Classification :

Fault Metrics – confirm fault injection success.

Stop‑Loss Metrics – define safety thresholds to abort experiments.

Observation Metrics – capture detailed system behavior for root‑cause analysis.

Red‑Blue War‑Game adopts Gremlin’s “chaos gameday” concept, with a structured pre‑game communication flow, execution flow, and post‑game review, helping teams comprehensively assess system resilience.

Strong/Weak Dependency Automation uses machine‑learning‑driven anomaly detection to automatically map service dependencies, now covering core scenarios in Douyin and Volcano.

Future Directions aim at infrastructure‑level chaos (e.g., IAAS with OpenStack), fully automated random experiments within defended targets, and fault‑intelligent diagnosis by correlating large‑scale fault‑metric data.

Overall, ByteDance’s evolving chaos engineering platform demonstrates a systematic approach to building resilient distributed systems through declarative fault injection, automated metric analysis, and continuous practice refinement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability chaos engineering Reliability Platform design Fault Injection

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.