How ByteDance Scales Services with Multi‑Region Unitization Architecture
This article explains ByteDance’s multi‑region unitization approach, covering its core concepts, motivations, architectural challenges, traffic routing, data synchronization, cut‑over strategies, and future evolution for large‑scale, resilient services. It also discusses operational optimizations, risk controls, and the impact on cost and development efficiency.
What Is Unitization
The core idea of unitization is to split business along a chosen dimension into self‑contained units, each capable of handling all operations for its subset of data. Traffic is sharded to units based on partition information, ensuring writes for the same partition go to the same unit.
Why Adopt Unitization
Resource limits : Physical constraints of data centers prevent infinite scaling, requiring multi‑region deployments.
Compliance : Regulations such as GDPR mandate that user data stay within specific regions.
Disaster recovery : Distributed units enable city‑level active‑active disaster recovery.
Additional Benefits
Improved user experience : Near‑by scheduling reduces latency.
Cost savings : Compared with traditional active‑passive setups, units handle live traffic, reducing redundant resources.
Isolation : Smaller units limit the impact radius of technical changes.
Challenges of Distributed Unitization
Data center latency : Inter‑city RTT can reach 40 ms, affecting cross‑region request times.
Data synchronization : Different storage engines and weak network conditions make reliable sync difficult.
Traffic routing : Determining the correct unit at various layers (client, gateway, RPC, storage) is complex.
Data correctness : Inconsistent routing can cause stale reads or dirty writes.
Cost : Each unit requires full compute, storage, and networking resources.
Management complexity : More units increase operational overhead.
ByteDance Multi‑Region Unitization Architecture
ByteDance’s deployment in mainland China consists of four dimensions: client routing, access‑layer correction, compute‑layer correction, and storage‑layer control. These layers ensure correct traffic scheduling and data access.
Four‑Layer Traffic Scheduling and Control
Client : A scheduling component routes traffic from the first hop to the correct unit, reducing intra‑network cross‑unit traffic.
Access layer : Gateways use plugins to compute routing info and correct misrouted traffic.
Compute layer : R&D frameworks or Service Mesh provide traffic interception for internal RPC calls.
Storage layer : Middleware intercepts storage access to audit and block incorrect unit traffic.
Key Issues in Deploying Unitization
Choosing Unit Dimension
Physical dimensions (e.g., Region, data center) are preferred because they provide a stable scheduling basis and simplify long‑term evolution.
Choosing Partition Dimension
Partition dimensions must be non‑overlapping, fine‑grained enough for flexible traffic distribution, lightweight to compute, and should keep intra‑unit calls closed.
ByteDance primarily uses UserID as the partition dimension for most services, with some services using Region.
Traffic Routing
Routing decisions are made at client, access, compute, and storage layers, each providing fallback correction to ensure traffic reaches the intended unit.
Data Synchronization
Two scenarios exist: one‑way sync for read‑only data in central services, and bidirectional sync for active‑active units, requiring conflict resolution and loop prevention.
Cut‑over Process and Reliability
During cut‑over, a two‑phase configuration change plus storage‑layer write‑disable prevents dirty writes caused by sync latency or staggered config rollout.
Configuration distribution uses long‑connection push combined with periodic pull to achieve fast convergence across millions of pods.
Routing loops are avoided by tracking correction attempts in request context and aborting repeated corrections.
Observability and Risk Control
Version monitoring of configuration rollout across instances.
Real‑time sync delay measurement combined with write‑disable duration to verify data consistency.
Business KPI monitoring (success rate, latency, load) during cut‑over.
Future Evolution
Cost optimization : Reducing buffer capacity per data center from 50 % to ~20 % and halving storage costs by splitting data across units.
More complex unit layouts : As new regions are added, unit placement and data migration strategies will need to evolve.
Enhanced multi‑active data capabilities : Supporting strong consistency across regions for high‑value services like e‑commerce and payments.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
