Cloud Native 20 min read

How ByteDance Scales Services with Multi‑Region Unitization Architecture

This article explains ByteDance’s multi‑region unitization approach, covering its core concepts, motivations, architectural challenges, traffic routing, data synchronization, cut‑over strategies, and future evolution for large‑scale, resilient services. It also discusses operational optimizations, risk controls, and the impact on cost and development efficiency.

Volcano Engine Developer Services

Oct 28, 2024

How ByteDance Scales Services with Multi‑Region Unitization Architecture

What Is Unitization

The core idea of unitization is to split business along a chosen dimension into self‑contained units, each capable of handling all operations for its subset of data. Traffic is sharded to units based on partition information, ensuring writes for the same partition go to the same unit.

Why Adopt Unitization

Resource limits : Physical constraints of data centers prevent infinite scaling, requiring multi‑region deployments.

Compliance : Regulations such as GDPR mandate that user data stay within specific regions.

Disaster recovery : Distributed units enable city‑level active‑active disaster recovery.

Additional Benefits

Improved user experience : Near‑by scheduling reduces latency.

Cost savings : Compared with traditional active‑passive setups, units handle live traffic, reducing redundant resources.

Isolation : Smaller units limit the impact radius of technical changes.

Challenges of Distributed Unitization

Data center latency : Inter‑city RTT can reach 40 ms, affecting cross‑region request times.

Data synchronization : Different storage engines and weak network conditions make reliable sync difficult.

Traffic routing : Determining the correct unit at various layers (client, gateway, RPC, storage) is complex.

Data correctness : Inconsistent routing can cause stale reads or dirty writes.

Cost : Each unit requires full compute, storage, and networking resources.

Management complexity : More units increase operational overhead.

ByteDance Multi‑Region Unitization Architecture

ByteDance’s deployment in mainland China consists of four dimensions: client routing, access‑layer correction, compute‑layer correction, and storage‑layer control. These layers ensure correct traffic scheduling and data access.

Four‑Layer Traffic Scheduling and Control

Client : A scheduling component routes traffic from the first hop to the correct unit, reducing intra‑network cross‑unit traffic.

Access layer : Gateways use plugins to compute routing info and correct misrouted traffic.

Compute layer : R&D frameworks or Service Mesh provide traffic interception for internal RPC calls.

Storage layer : Middleware intercepts storage access to audit and block incorrect unit traffic.

Key Issues in Deploying Unitization

Choosing Unit Dimension

Physical dimensions (e.g., Region, data center) are preferred because they provide a stable scheduling basis and simplify long‑term evolution.

Choosing Partition Dimension

Partition dimensions must be non‑overlapping, fine‑grained enough for flexible traffic distribution, lightweight to compute, and should keep intra‑unit calls closed.

ByteDance primarily uses UserID as the partition dimension for most services, with some services using Region.

Traffic Routing

Routing decisions are made at client, access, compute, and storage layers, each providing fallback correction to ensure traffic reaches the intended unit.

Data Synchronization

Two scenarios exist: one‑way sync for read‑only data in central services, and bidirectional sync for active‑active units, requiring conflict resolution and loop prevention.

Cut‑over Process and Reliability

During cut‑over, a two‑phase configuration change plus storage‑layer write‑disable prevents dirty writes caused by sync latency or staggered config rollout.

Configuration distribution uses long‑connection push combined with periodic pull to achieve fast convergence across millions of pods.

Routing loops are avoided by tracking correction attempts in request context and aborting repeated corrections.

Observability and Risk Control

Version monitoring of configuration rollout across instances.

Real‑time sync delay measurement combined with write‑disable duration to verify data consistency.

Business KPI monitoring (success rate, latency, load) during cut‑over.

Future Evolution

Cost optimization : Reducing buffer capacity per data center from 50 % to ~20 % and halving storage costs by splitting data across units.

More complex unit layouts : As new regions are added, unit placement and data migration strategies will need to evolve.

Enhanced multi‑active data capabilities : Supporting strong consistency across regions for high‑value services like e‑commerce and payments.

cloud-native traffic routing Data Synchronization unitization multi-region

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.