How ByteDance Scales High Availability with Chaos Engineering: From Platform 1.0 to 2.0
This article details ByteDance's evolution of chaos engineering platforms and high‑availability practices, covering service types, architectural upgrades, fault‑center design, explosion‑radius control, steady‑state algorithms, automated experiments, and future plans for resilient infrastructure.
Introduction
ByteDance operates many apps and services; to ensure high availability it adopts chaos engineering. This article describes the evolution of ByteDance's chaos engineering technology and practices for building high‑availability systems.
System Governance Team
The system governance team, part of the infrastructure group, is responsible for the closed‑loop ecosystem of development, integration, release, microservice governance, traffic scheduling, capacity analysis, and using chaos engineering to improve availability.
Service Types
Online services : backend services for Douyin, Xigua Video, etc., running on large‑scale Kubernetes PaaS clusters.
Offline services : recommendation model training, big‑data report calculations, relying on massive storage and compute.
Infrastructure : provides PaaS capabilities such as compute and storage for all business lines.
High‑Availability Concerns per Service Type
Online services : stateless, run in containers, external MySQL/Redis storage, easy to scale, may use degradation.
Offline services : stateful, long‑running jobs, tolerant of retries, depend on storage consistency.
Infrastructure : stateful, provides storage and compute, faces network or disk failures, focuses on data consistency.
Chaos Engineering for Online Services – 1.0
Platform 1.0 was mainly a fault‑injection system.
The platform offered a visual UI for injecting simple faults (e.g., latency, network loss) via agents on host machines.
It did not fully satisfy Netflix's five Principles of Chaos, lacking a robust steady‑state hypothesis, diverse real‑world events, production‑level experiments, continuous automation, and precise explosion‑radius control.
Steady‑state hypothesis was simplistic.
Only basic fault types were supported.
Production experiments were limited.
Automation of experiments was weak.
Scope control and explosion‑radius management were inadequate.
Chaos Engineering Platform 2.0
In 2019 the platform was upgraded to a true chaos‑engineering system.
Key upgrades:
Architecture upgrade : introduced a fault‑center layer to decouple business logic from fault injection.
Fault injection : leveraged Service Mesh sidecars for network‑related faults.
Stability model : built a steady‑state system using key metrics and machine‑learning algorithms to assess stability automatically.
Fault‑Center Architecture
Inspired by Kubernetes, the fault‑center uses declarative APIs to describe desired fault states (e.g., network partition between A and B) and controllers to enforce them, integrating open‑source tools like Chaos Mesh, Chaos Blade, and custom controllers.
Explosion‑Radius Control
The fault model includes Target, Scope Filter, Dependency, and Action to precisely limit impact.
Steady‑State System
Algorithms used:
Dynamic time‑series analysis (threshold detection, 3‑Sigma, sparse rules).
AB‑test style stability analysis (Mann‑Whitney U test).
Consistency detection for strong/weak dependencies.
Automated Experiments
Experiments run without human intervention, injecting faults and evaluating stability, with use cases such as strong/weak dependency analysis.
Infrastructure Chaos Platform
A dedicated platform supports chaos experiments for offline services and infrastructure, allowing injection of CPU, memory, filesystem, network, and other faults in a safe environment.
Parallel and sequential task execution.
Pause & resume capability.
Master‑slave node identification.
From Chaos Engineering to High‑Availability Construction
High‑availability is quantified by MTTR (Mean Time To Repair), MTBF (Mean Time Between Failures), incident count N, and impact scope S. Reducing S, N, and MTTR while increasing MTBF improves availability.
Strategies include unit‑level isolation, multi‑datacenter deployment, independent core‑business deployment, asynchronous processing, robust deployment (multi‑active, traffic steering, fallback, runbooks), service governance (timeouts, circuit breakers), comprehensive monitoring, fast diagnosis (AI‑assisted analysis), and pre‑defined recovery playbooks.
Future Plans
Fine‑grained fault capabilities across different layers.
Expand chaos‑engineering scenarios, automate more use cases, lower adoption cost, and build a lightweight platform.
Integrate fault‑budget mechanisms to quantify loss and guide chaos‑engineering investment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
