Operations 17 min read

How ByteDance Scales High Availability with Chaos Engineering: From Platform 1.0 to 2.0

This article details ByteDance's evolution of chaos engineering platforms and high‑availability practices, covering service types, architectural upgrades, fault‑center design, explosion‑radius control, steady‑state algorithms, automated experiments, and future plans for resilient infrastructure.

Volcano Engine Developer Services

May 24, 2021

How ByteDance Scales High Availability with Chaos Engineering: From Platform 1.0 to 2.0

Introduction

ByteDance operates many apps and services; to ensure high availability it adopts chaos engineering. This article describes the evolution of ByteDance's chaos engineering technology and practices for building high‑availability systems.

System Governance Team

The system governance team, part of the infrastructure group, is responsible for the closed‑loop ecosystem of development, integration, release, microservice governance, traffic scheduling, capacity analysis, and using chaos engineering to improve availability.

Service Types

Online services : backend services for Douyin, Xigua Video, etc., running on large‑scale Kubernetes PaaS clusters.

Offline services : recommendation model training, big‑data report calculations, relying on massive storage and compute.

Infrastructure : provides PaaS capabilities such as compute and storage for all business lines.

High‑Availability Concerns per Service Type

Online services : stateless, run in containers, external MySQL/Redis storage, easy to scale, may use degradation.

Offline services : stateful, long‑running jobs, tolerant of retries, depend on storage consistency.

Infrastructure : stateful, provides storage and compute, faces network or disk failures, focuses on data consistency.

Chaos Engineering for Online Services – 1.0

Platform 1.0 was mainly a fault‑injection system.

Chaos Engineering Platform 1.0 Architecture

The platform offered a visual UI for injecting simple faults (e.g., latency, network loss) via agents on host machines.

It did not fully satisfy Netflix's five Principles of Chaos, lacking a robust steady‑state hypothesis, diverse real‑world events, production‑level experiments, continuous automation, and precise explosion‑radius control.

Steady‑state hypothesis was simplistic.

Only basic fault types were supported.

Production experiments were limited.

Automation of experiments was weak.

Scope control and explosion‑radius management were inadequate.

Chaos Engineering Platform 2.0

In 2019 the platform was upgraded to a true chaos‑engineering system.

Key upgrades:

Architecture upgrade : introduced a fault‑center layer to decouple business logic from fault injection.

Fault injection : leveraged Service Mesh sidecars for network‑related faults.

Stability model : built a steady‑state system using key metrics and machine‑learning algorithms to assess stability automatically.

Fault‑Center Architecture

Inspired by Kubernetes, the fault‑center uses declarative APIs to describe desired fault states (e.g., network partition between A and B) and controllers to enforce them, integrating open‑source tools like Chaos Mesh, Chaos Blade, and custom controllers.

Explosion‑Radius Control

The fault model includes Target, Scope Filter, Dependency, and Action to precisely limit impact.

Steady‑State System

Algorithms used:

Dynamic time‑series analysis (threshold detection, 3‑Sigma, sparse rules).

AB‑test style stability analysis (Mann‑Whitney U test).

Consistency detection for strong/weak dependencies.

Automated Experiments

Experiments run without human intervention, injecting faults and evaluating stability, with use cases such as strong/weak dependency analysis.

Infrastructure Chaos Platform

A dedicated platform supports chaos experiments for offline services and infrastructure, allowing injection of CPU, memory, filesystem, network, and other faults in a safe environment.

Parallel and sequential task execution.

Pause & resume capability.

Master‑slave node identification.

From Chaos Engineering to High‑Availability Construction

High‑availability is quantified by MTTR (Mean Time To Repair), MTBF (Mean Time Between Failures), incident count N, and impact scope S. Reducing S, N, and MTTR while increasing MTBF improves availability.

Strategies include unit‑level isolation, multi‑datacenter deployment, independent core‑business deployment, asynchronous processing, robust deployment (multi‑active, traffic steering, fallback, runbooks), service governance (timeouts, circuit breakers), comprehensive monitoring, fast diagnosis (AI‑assisted analysis), and pre‑defined recovery playbooks.

Future Plans

Fine‑grained fault capabilities across different layers.

Expand chaos‑engineering scenarios, automate more use cases, lower adoption cost, and build a lightweight platform.

Integrate fault‑budget mechanisms to quantify loss and guide chaos‑engineering investment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

automation Kubernetes chaos engineering

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.