Operations 18 min read

Alibaba’s Multi‑Active Data Center Architecture: Design, Challenges, and Lessons Learned

In this interview, Alibaba’s Lin Hao (aka Bi Xuan) explains the motivations, deployment details, technical challenges such as latency, routing consistency, and data consistency, and the solutions behind the company’s multi‑active disaster‑recovery data‑center architecture that powers its massive e‑commerce platform.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba’s Multi‑Active Data Center Architecture: Design, Challenges, and Lessons Learned

In the era of big data, remote disaster recovery for data centers has become crucial, and Alibaba launched a multi‑active data‑center project before last year’s Double‑11 shopping festival.

InfoQ: Please introduce the remote multi‑active data‑center project. Bi Xuan: Internally the project is called "unitization"; the first stage is single‑site, the second is dual‑active, and the third is multi‑active. It distributes transaction processing across multiple geographically separated data centers, each handling traffic locally.

The initiative began around 2009‑2010 when Alibaba experimented with remote data centers after successfully deploying multiple centers within the same city. Traditional industry practice used remote sites only as cold backups, which proved costly and unreliable for Alibaba’s massive scale.

Key drivers for adopting multi‑active architecture were the rapid growth of Alibaba’s e‑commerce, logistics, cloud, and big‑data services, which could no longer be accommodated in a single city, and the need to avoid bottlenecks in a monolithic architecture.

InfoQ: How is the project deployed? Bi Xuan: During the last Double‑11, Alibaba operated two data centers—one in Hangzhou and another in a different city—each handling 50% of user traffic. All user actions, from browsing to checkout, were processed entirely within the user’s assigned data center without cross‑center communication.

The advantages of remote multi‑active deployment include real‑time traffic handling, immediate failover without the high cost of full‑site cold backups, and reduced risk compared to traditional disaster‑recovery models.

InfoQ: What challenges did you face? Bi Xuan: The biggest challenge was latency; inter‑city communication adds up to about 100 ms, which becomes significant when a single page may involve hundreds of service calls. To mitigate this, Alibaba pursued "unitization"—ensuring all operations for a user stay within one data center. Another challenge was data consistency: with multiple active sites, writes could occur in different locations, risking divergent data. Alibaba chose the buyer dimension as the primary partitioning key, ensuring that all buyer‑related transactions are confined to a single unit. Routing consistency is critical; the routing layer must direct a user’s request to the correct data center and maintain that path through front‑end, back‑end services, and databases. Any routing error could result in missing or incorrect data. Data synchronization latency must stay under one second nationwide; existing open‑source solutions could not meet this requirement, prompting Alibaba to develop custom synchronization mechanisms, now offered as a service on Alibaba Cloud. Ensuring data correctness across sites is paramount—while business‑level failures are tolerable, data corruption is not. The system therefore enforces that a single row is written in only one location.

Alibaba also emphasizes rapid fault recovery as a key metric of high availability. By instantly diverting traffic from a failed site to another active site, the company can contain incidents within a minute, achieving “four‑nine” or “five‑nine” availability levels.

InfoQ: Do you still work on Java performance issues? Bi Xuan: Yes; he continues to troubleshoot performance problems and shares insights through the "HelloJava" WeChat public account, helping others learn from real‑world incidents.

When Alibaba’s needs exceed what the OpenJDK community prioritizes, the team builds custom JVM improvements internally, leveraging expertise from engineers like Zhao Haiping, who previously worked on the HipHop PHP engine.

Overall, the multi‑active data‑center architecture demonstrates how large‑scale e‑commerce platforms can achieve low‑latency, high‑availability services through careful partitioning, routing, and synchronization strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibabamulti-active
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.