Operations 17 min read

How Alipay Achieved Near‑Zero Downtime with Multi‑Datacenter Failover Architecture

This article explains the evolution of Alipay's high‑availability and disaster‑recovery architecture—from a simple single‑datacenter design to a multi‑datacenter, unit‑based system with failover and blue‑green deployment—highlighting the challenges, solutions, and operational benefits that enable continuous service during massive traffic spikes.

21CTO
21CTO
21CTO
How Alipay Achieved Near‑Zero Downtime with Multi‑Datacenter Failover Architecture

Why High Availability and Disaster Recovery Matter

In enterprise services, cloud computing, and mobile internet, high‑availability distributed technology is essential for reliable platform operation and user confidence, especially during peak traffic such as Double‑11.

Definition of Disaster‑Recovery Systems

A disaster‑recovery system provides an environment that can withstand various failures—hardware, software, network, power outages, or human errors—ensuring data safety (data DR) and, in advanced cases, uninterrupted application service (application DR).

Alipay’s Architecture Evolution

Pure Stage (2004‑2011)

Early Alipay used a single logical datacenter with commercial load balancers (LB) directing traffic to a VIP‑based gateway. All core services shared one database, resulting in low disaster‑recovery capability; a single node failure could cause service interruption.

Naive Stage (2011‑2012)

To eliminate single points, Alipay split the logical datacenter into multiple physical sites, replacing hardware LB with software load balancing and introducing a configuration center for service registration and health monitoring. Data was horizontally sharded by user ID, and a session/data module decoupled long‑lived connections.

Mature Stage (2012‑2015)

With multi‑datacenter active‑active architecture and a Failover layer, Alipay could switch all read/write traffic to a dedicated Failover database within minutes, preserving data consistency and minimizing user impact. Blue‑Green deployment further reduced risk by routing a small percentage of traffic to the new version before full rollout.

Key Challenges Solved

Eliminated database connection‑count bottlenecks through horizontal sharding.

Overcame IDC resource limits by expanding beyond a single city.

Reduced cross‑IDC latency via localized data access.

Implemented blue‑green release to minimize user disruption.

Achieved high‑availability, multi‑region active‑active deployment.

Unit‑Based Architecture

Alipay introduced logical units (A, B, C) where each unit contains isolated data shards and independent service groups (Blue/Green). Unit C holds globally replicated data to support read‑write separation across regions, enabling fast failover and consistent performance.

Summary

More flexible traffic control allows rapid, fine‑grained data switching.

Customizable resource allocation per data unit supports scalable transaction volumes.

Fast disaster‑recovery enables seamless traffic migration between Blue/Green groups, units, datacenters, and even cities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemshigh availabilityBlue‑Green deploymentdisaster recoveryfailovercloud operationsAlipay architecture
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.