How Alibaba Achieves Multi‑Site High Availability: Architecture Deep Dive
This article explains Alibaba's multi‑site high‑availability architecture, covering its origins after Double 11 bottlenecks, core principles like decentralization and consistency‑availability trade‑offs, layered design from traffic routing to data storage, and a real‑world deployment example.
Background and Motivation
Alibaba introduced Multi‑Site High Availability (also called "异地多活") after experiencing severe capacity and stability limits during the Double 11 shopping festival, where a single data center could not handle the traffic surge and posed a high single‑point‑of‑failure risk.
Core Principles
Decentralization : Each site can independently process and respond to requests, eliminating single points of failure.
Consistency‑Availability Balance : Various replication strategies (synchronous, asynchronous, semi‑synchronous, eventual consistency) are used to trade off data consistency against performance.
Health Monitoring and Automatic Failover : Continuous health checks detect site failures or capacity bottlenecks and automatically adjust traffic distribution.
Architecture Overview
The architecture is divided into several layers:
Traffic Routing Layer : DNS/GSLB performs global routing based on geography, business weight, and site health. When a site becomes unavailable, traffic weight is shifted to healthy sites.
Access and Gateway Layer : SLB, API gateways, and reverse proxies identify the target site and ensure requests reach the correct cluster, providing session persistence and idempotency control.
Application and Service Layer : Each site runs a complete micro‑service ecosystem with independent service discovery, configuration, rate limiting, and circuit breaking.
Data and Storage Layer : Site‑level database instances or sharded clusters (e.g., order and account databases) route writes by site, minimizing cross‑site writes. Data replication tools such as DTS, binlog subscription, or log‑based sync enable cross‑site data consistency and disaster recovery.
用户 ↓ DNS / GSLB(全局调度) ↓ 接入层(Nginx / Gateway) ↓ 应用层(微服务集群) ↓ 数据层(数据库 / 缓存 / MQ)Practical Scenario: Double 11
During Double 11, traffic spikes to dozens of times the normal level, overwhelming a single data center. Alibaba adopts a "site‑plus‑multi‑site" model, partitioning users by ID or region and deploying a complete data center in each city. All user interactions—browsing, searching, adding to cart, ordering, and payment—are completed within the user's assigned site, eliminating cross‑site calls.
The multi‑site capability is abstracted as a "business multi‑site disaster‑recovery solution" and offered to external customers, covering traffic routing, access, application, middleware, database, and big‑data scenarios, and providing templates for three‑site five‑center high‑availability deployments for government, enterprise, and financial sectors.
Mike Chen's Internet Architecture
Over ten years of BAT architecture experience, shared generously!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
