Operations 17 min read

How Alipay Built Seamless High Availability and Disaster Recovery for Millions of Transactions

This article examines Alipay's evolution from a simple single‑datacenter setup to a multi‑active‑active, unit‑based architecture, detailing the technical challenges of high availability, disaster recovery, failover design, blue‑green deployment, and how these solutions enable continuous service during massive traffic spikes like Double 11.

21CTO
21CTO
21CTO
How Alipay Built Seamless High Availability and Disaster Recovery for Millions of Transactions

High availability and disaster recovery are critical for enterprise services, cloud computing, and mobile internet platforms, ensuring correct business processing and uninterrupted service to maintain user confidence, especially during traffic peaks.

Data loss risks demand robust disaster recovery systems that protect data (data DR) and, in advanced cases, provide uninterrupted application services (application DR), representing the highest level of data backup.

During the global Double 11 shopping festival, Alipay processed 912.17 billion CNY in a single day, with 68% on mobile and peak transaction rates of 140,000 per second, testing the platform's IT capabilities; continuous availability and rapid disaster recovery are the ultimate goals for Alipay engineers.

Alipay's architecture evolved through three distinct stages:

Stage 1 (2004‑2011): Early Simplified Architecture

The system used commercial load balancers (LB) to route traffic to an entry gateway, with services exposed via VIPs and a single database per core system. This "physically multi‑datacenter, logically single‑datacenter" setup handled only tens of thousands of transactions daily, suffered from single‑point failures, and required manual failover that often caused downtime.

Stage 2 (2011‑2012): Distributed Soft‑Load‑Balancing

The logical single datacenter was split into multiple sites, replacing hardware LB with software load balancing. A configuration center was divided into Session and Data modules, enabling service registration and real‑time status notifications. Data was horizontally sharded by user UID, requiring transparent application access and a middleware to manage shard rules.

Stage 3 (2012‑2015): Multi‑Datacenter Active‑Active Architecture

Each datacenter became an independent node with its own data shards. Horizontal data partitioning allowed theoretically unlimited scaling. Applications were deployed across 3‑4 datacenters, with backup replicas distributed to achieve multi‑site disaster recovery.

Data horizontal split enables unlimited resource expansion.

Independent multi‑datacenter deployment eliminates single‑site impact.

Soft load balancing isolates service calls within a datacenter.

Overall availability and reliability are significantly improved.

Despite solving many single‑point issues, master‑slave DB failover still caused data loss and service interruption. Alipay introduced a dedicated Failover layer: the primary DB syncs to a standby, while the Failover DB remains empty and invisible during normal operation. In a failure, traffic is switched to the Failover DB, allowing a full recovery within five minutes, after which traffic returns to the primary.

New challenges emerged: insufficient DB connections, IDC resource limits (e.g., power cuts in Hangzhou), cross‑IDC traffic overload, and latency. These drove the need for further architectural innovations:

Eliminate DB connection bottlenecks.

Expand beyond single‑city IDC resources.

Ensure continuous business operation.

Implement blue‑green deployment to minimize user impact.

Achieve active‑active across regions.

Adopt unitization (units A, B, C) to isolate core and tail workloads.

Unitization splits data horizontally per user ID (unit A for core services, unit B for non‑core, unit C for global replicated data). Unit C enables read‑write locality and reduces cross‑region latency. Data replication is performed either via DB sync for less latency‑sensitive services or via a reliable message bus for millisecond‑level synchronization.

These measures provide flexible traffic control, customized data flow allocation, and rapid disaster recovery, allowing Alipay to maintain high availability even during massive events like Double 11.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemshigh availabilityBlue‑Green deploymentdisaster recoveryfailovermulti‑datacenterAlipay
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.