How Alipay Built Seamless High Availability and Disaster Recovery for Millions of Transactions
This article examines Alipay's evolution from a simple single‑datacenter setup to a multi‑active‑active, unit‑based architecture, detailing the technical challenges of high availability, disaster recovery, failover design, blue‑green deployment, and how these solutions enable continuous service during massive traffic spikes like Double 11.
High availability and disaster recovery are critical for enterprise services, cloud computing, and mobile internet platforms, ensuring correct business processing and uninterrupted service to maintain user confidence, especially during traffic peaks.
Data loss risks demand robust disaster recovery systems that protect data (data DR) and, in advanced cases, provide uninterrupted application services (application DR), representing the highest level of data backup.
During the global Double 11 shopping festival, Alipay processed 912.17 billion CNY in a single day, with 68% on mobile and peak transaction rates of 140,000 per second, testing the platform's IT capabilities; continuous availability and rapid disaster recovery are the ultimate goals for Alipay engineers.
Alipay's architecture evolved through three distinct stages:
Stage 1 (2004‑2011): Early Simplified Architecture
The system used commercial load balancers (LB) to route traffic to an entry gateway, with services exposed via VIPs and a single database per core system. This "physically multi‑datacenter, logically single‑datacenter" setup handled only tens of thousands of transactions daily, suffered from single‑point failures, and required manual failover that often caused downtime.
Stage 2 (2011‑2012): Distributed Soft‑Load‑Balancing
The logical single datacenter was split into multiple sites, replacing hardware LB with software load balancing. A configuration center was divided into Session and Data modules, enabling service registration and real‑time status notifications. Data was horizontally sharded by user UID, requiring transparent application access and a middleware to manage shard rules.
Stage 3 (2012‑2015): Multi‑Datacenter Active‑Active Architecture
Each datacenter became an independent node with its own data shards. Horizontal data partitioning allowed theoretically unlimited scaling. Applications were deployed across 3‑4 datacenters, with backup replicas distributed to achieve multi‑site disaster recovery.
Data horizontal split enables unlimited resource expansion.
Independent multi‑datacenter deployment eliminates single‑site impact.
Soft load balancing isolates service calls within a datacenter.
Overall availability and reliability are significantly improved.
Despite solving many single‑point issues, master‑slave DB failover still caused data loss and service interruption. Alipay introduced a dedicated Failover layer: the primary DB syncs to a standby, while the Failover DB remains empty and invisible during normal operation. In a failure, traffic is switched to the Failover DB, allowing a full recovery within five minutes, after which traffic returns to the primary.
New challenges emerged: insufficient DB connections, IDC resource limits (e.g., power cuts in Hangzhou), cross‑IDC traffic overload, and latency. These drove the need for further architectural innovations:
Eliminate DB connection bottlenecks.
Expand beyond single‑city IDC resources.
Ensure continuous business operation.
Implement blue‑green deployment to minimize user impact.
Achieve active‑active across regions.
Adopt unitization (units A, B, C) to isolate core and tail workloads.
Unitization splits data horizontally per user ID (unit A for core services, unit B for non‑core, unit C for global replicated data). Unit C enables read‑write locality and reduces cross‑region latency. Data replication is performed either via DB sync for less latency‑sensitive services or via a reliable message bus for millisecond‑level synchronization.
These measures provide flexible traffic control, customized data flow allocation, and rapid disaster recovery, allowing Alipay to maintain high availability even during massive events like Double 11.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
