Industry Insights 5 min read

How Alibaba Achieves Multi‑Site High Availability: Architecture Deep Dive

This article explains Alibaba's multi‑site high‑availability architecture, covering its origins after Double 11 bottlenecks, core principles like decentralization and consistency‑availability trade‑offs, layered design from traffic routing to data storage, and a real‑world deployment example.

Mike Chen's Internet Architecture

Mar 26, 2026

How Alibaba Achieves Multi‑Site High Availability: Architecture Deep Dive

Background and Motivation

Alibaba introduced Multi‑Site High Availability (also called "异地多活") after experiencing severe capacity and stability limits during the Double 11 shopping festival, where a single data center could not handle the traffic surge and posed a high single‑point‑of‑failure risk.

Core Principles

Decentralization : Each site can independently process and respond to requests, eliminating single points of failure.

Consistency‑Availability Balance : Various replication strategies (synchronous, asynchronous, semi‑synchronous, eventual consistency) are used to trade off data consistency against performance.

Health Monitoring and Automatic Failover : Continuous health checks detect site failures or capacity bottlenecks and automatically adjust traffic distribution.

Architecture Overview

The architecture is divided into several layers:

Traffic Routing Layer : DNS/GSLB performs global routing based on geography, business weight, and site health. When a site becomes unavailable, traffic weight is shifted to healthy sites.

Access and Gateway Layer : SLB, API gateways, and reverse proxies identify the target site and ensure requests reach the correct cluster, providing session persistence and idempotency control.

Application and Service Layer : Each site runs a complete micro‑service ecosystem with independent service discovery, configuration, rate limiting, and circuit breaking.

Data and Storage Layer : Site‑level database instances or sharded clusters (e.g., order and account databases) route writes by site, minimizing cross‑site writes. Data replication tools such as DTS, binlog subscription, or log‑based sync enable cross‑site data consistency and disaster recovery.

用户 ↓ DNS / GSLB（全局调度） ↓ 接入层（Nginx / Gateway） ↓ 应用层（微服务集群） ↓ 数据层（数据库 / 缓存 / MQ）

Practical Scenario: Double 11

During Double 11, traffic spikes to dozens of times the normal level, overwhelming a single data center. Alibaba adopts a "site‑plus‑multi‑site" model, partitioning users by ID or region and deploying a complete data center in each city. All user interactions—browsing, searching, adding to cart, ordering, and payment—are completed within the user's assigned site, eliminating cross‑site calls.

The multi‑site capability is abstracted as a "business multi‑site disaster‑recovery solution" and offered to external customers, covering traffic routing, access, application, middleware, database, and big‑data scenarios, and providing templates for three‑site five‑center high‑availability deployments for government, enterprise, and financial sectors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Cloud Native architecture operations High Availability Multi‑Site

Written by

Mike Chen's Internet Architecture

Over ten years of BAT architecture experience, shared generously!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background and Motivation

Core Principles

Architecture Overview

Practical Scenario: Double 11

Mike Chen's Internet Architecture

How this landed with the community

Was this worth your time?

0 Comments

Practical Scenario: Double 11