How Alipay Achieved Near‑Zero Downtime with Multi‑Datacenter Failover Architecture
This article explains the evolution of Alipay's high‑availability and disaster‑recovery architecture—from a simple single‑datacenter design to a multi‑datacenter, unit‑based system with failover and blue‑green deployment—highlighting the challenges, solutions, and operational benefits that enable continuous service during massive traffic spikes.
Why High Availability and Disaster Recovery Matter
In enterprise services, cloud computing, and mobile internet, high‑availability distributed technology is essential for reliable platform operation and user confidence, especially during peak traffic such as Double‑11.
Definition of Disaster‑Recovery Systems
A disaster‑recovery system provides an environment that can withstand various failures—hardware, software, network, power outages, or human errors—ensuring data safety (data DR) and, in advanced cases, uninterrupted application service (application DR).
Alipay’s Architecture Evolution
Pure Stage (2004‑2011)
Early Alipay used a single logical datacenter with commercial load balancers (LB) directing traffic to a VIP‑based gateway. All core services shared one database, resulting in low disaster‑recovery capability; a single node failure could cause service interruption.
Naive Stage (2011‑2012)
To eliminate single points, Alipay split the logical datacenter into multiple physical sites, replacing hardware LB with software load balancing and introducing a configuration center for service registration and health monitoring. Data was horizontally sharded by user ID, and a session/data module decoupled long‑lived connections.
Mature Stage (2012‑2015)
With multi‑datacenter active‑active architecture and a Failover layer, Alipay could switch all read/write traffic to a dedicated Failover database within minutes, preserving data consistency and minimizing user impact. Blue‑Green deployment further reduced risk by routing a small percentage of traffic to the new version before full rollout.
Key Challenges Solved
Eliminated database connection‑count bottlenecks through horizontal sharding.
Overcame IDC resource limits by expanding beyond a single city.
Reduced cross‑IDC latency via localized data access.
Implemented blue‑green release to minimize user disruption.
Achieved high‑availability, multi‑region active‑active deployment.
Unit‑Based Architecture
Alipay introduced logical units (A, B, C) where each unit contains isolated data shards and independent service groups (Blue/Green). Unit C holds globally replicated data to support read‑write separation across regions, enabling fast failover and consistent performance.
Summary
More flexible traffic control allows rapid, fine‑grained data switching.
Customizable resource allocation per data unit supports scalable transaction volumes.
Fast disaster‑recovery enables seamless traffic migration between Blue/Green groups, units, datacenters, and even cities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
