Operations 15 min read

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

This article details ByteDance’s disaster‑recovery evolution—from a single‑room deployment to same‑city multi‑data‑center setups and finally to active‑active multi‑region architectures—explaining the challenges, specific failure scenarios, and the strategic practices used to ensure continuous service during outages.

Volcano Engine Developer Services

Sep 2, 2024

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

Evolution Path of Disaster Recovery Architecture

Broadly speaking, disaster recovery is the business continuity plan that tolerates failures; it requires rapid fault‑tolerance and failover capabilities, encompassing both routine resilience construction and periodic drill verification.

The author shares ByteDance’s disaster‑recovery practice, divided into three parts: the evolution path, the concrete practices, and an overview of implementation status.

Domestic Disaster Recovery Construction

ByteDance’s domestic disaster‑recovery architecture has progressed through three stages: single‑data‑center , same‑city multi‑data‑center , and the current active‑active multi‑region mode.

Single‑Data‑Center : In the early stage, ByteDance used a single data center in North China. Rapid business growth in 2018 hit a resource ceiling, prompting a shift to a same‑city dual‑data‑center setup.

In 2019 a major incident—fiber cut due to road construction—exposed shortcomings: the control plane was not independently deployed, and the dual‑data‑center design lacked true disaster‑redundancy, leading to insufficient resources for failover.

Same‑City Multi‑Data‑Center : ByteDance adopted full‑mesh IDC interconnection, separating control and data planes. Business‑specific master‑node placement reduced pressure on any single site, and disaster‑recovery complexity increased. Two key failure scenarios were highlighted:

Fiber cut : Full‑mesh routing enables traffic detour when a single fiber fails.

AZ (Availability Zone) outage : Traffic from an unavailable AZ is proportionally shifted to healthy AZs.

Planning must include inter‑site connectivity, layered degradation capabilities, and thorough pre‑assessment.

Active‑Active Multi‑Region : ByteDance measured RTT between East and North China (>30 ms), making strict strong‑consistency services unsuitable for cross‑region deployment. Read‑intensive workloads (e.g., news feeds) use active‑active multi‑region, while e‑commerce follows conventional patterns. In case of inter‑region fiber cuts, offline services are degraded first; if no offline data exists, online services undergo layered degradation and possible traffic rerouting. When an AZ is unavailable, resources are first evaluated within the same city; if insufficient, traffic is shifted across regions, with careful handling of unit‑level data consistency.

Key Points of Disaster‑Recovery Implementation

Architecture Design : Design is based on real incidents, incorporating industry lessons and internal accident analysis to create a resilient architecture and rapid escape mechanisms.

Plan Construction : Requires coordination across infrastructure, platform components, and business layers, integrating traffic management, configuration, and service control capabilities.

Disaster‑Recovery Drills : Regular drills validate the plan’s effectiveness; ByteDance’s drill evolution moved from full‑on‑site participation to online, then to red‑team/blue‑team confrontations, aiming for low‑cost, high‑efficiency verification.

Summary

ByteDance’s domestic disaster‑recovery adopts a hybrid of same‑city redundancy and active‑active multi‑region deployment, achieving region‑level traffic switchover within 20 minutes and storage layer switchover within 30 minutes. Continuous improvement focuses on resource assessment, traffic routing, and data recovery, with an emphasis on integrating disaster‑recovery considerations into normal architecture design.

operations high availability disaster recovery site reliability Infrastructure multi-region

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.