Alibaba's Evolution to Geo‑Distributed Multi‑Active Architecture: Design, Challenges, and Lessons
The article details Alibaba's three‑stage evolution from same‑city active‑active to two‑site three‑center disaster recovery and finally to geo‑distributed multi‑active architecture, explaining its motivations, technical challenges such as latency and data consistency, the solutions implemented, and the resulting scalability and fault‑tolerance benefits.
After the 2020 "Double Eleven" shopping festival, Alibaba publicly announced its shift from "dual‑active" to "multi‑active" across geographically separated data centers, aiming for continuous availability beyond traditional disaster recovery.
The evolution of Alibaba's high‑availability architecture can be divided into three steps: same‑city dual‑active, remote read‑only and cold‑standby, and finally geo‑distributed multi‑active. The early dual‑active model relied on two same‑city production data centers with synchronous replication, which limited distance due to latency.
Traditional two‑site three‑center disaster recovery faced three major issues: the remote backup center was cold and could not take traffic instantly; resources were under‑utilized, raising costs; and all writes were confined to a single site, causing storage and scaling pressure during peak events like Double Eleven.
To overcome these problems, Alibaba introduced multi‑active across multiple regions (typically >1000 km apart). The key goals were: multiple cross‑region data centers, each handling full read‑write traffic, multi‑point writes to avoid latency bottlenecks, and the ability for any center to take over traffic within minutes.
The main technical challenges were latency (e.g., 30 ms round‑trip could add several seconds to a page response due to hundreds of backend calls) and data consistency under multi‑point writes (ensuring a user’s transaction is recorded correctly and visible everywhere). Alibaba addressed latency by keeping as many operations as possible within a single data center and by modularizing services into "units" that could be deployed independently.
For consistency, Alibaba built a custom data‑synchronization system (supplementing OceanBase and MySQL) that kept cross‑center replication within one second during the 2015 Double Eleven event. Real‑time validation and protective layers were added before database writes to prevent divergent data.
The multi‑active architecture brings two major benefits: strong horizontal scalability—new units can be added to handle increased transaction volume without complex re‑engineering—and robust fault‑tolerance, allowing rapid failover at the instance, data‑center, city, or global level.
Implementation timeline: 2013 – same‑city dual‑active units; 2014 – near‑city dual‑active with some read‑only traffic; 2015 onward – geo‑distributed multi‑active across centers >1000 km apart, expanding from two to three or four centers, enabling nationwide deployment and simplifying capacity planning for future Double Eleven events.
Alibaba now exposes many of the underlying technologies (e.g., DTS, EDAS, DRDS, ONS) to external users, turning internal multi‑active capabilities into cloud services that other enterprises can adopt.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
