Operations 16 min read

High Availability and Disaster Recovery Architecture: The Evolution of Alipay’s System Design

This article examines the importance of high‑availability and disaster‑recovery architectures, tracing Alipay’s evolution from a simple load‑balanced setup through multi‑datacenter, failover, and unit‑based designs that address scalability, data consistency, and continuous service delivery challenges.

Architecture Digest
Architecture Digest
Architecture Digest
High Availability and Disaster Recovery Architecture: The Evolution of Alipay’s System Design

High availability and disaster recovery (DR) are critical for enterprise services, cloud computing, and mobile internet platforms, ensuring uninterrupted service and user confidence, especially during traffic spikes such as China’s "Double 11" shopping festival.

Early Alipay architecture (2004‑2011) relied on commercial load balancers and a single database per core system, leading to single‑point failures and limited DR capabilities.

In the second stage (2011‑2012), Alipay split logical data centers, introduced soft load balancing, and implemented horizontal data sharding based on user UID, reducing single‑point bottlenecks and enabling multi‑datacenter active‑active deployment.

A dedicated Failover layer was added to handle master‑slave switchovers within minutes, preserving data integrity and minimizing service disruption.

From 2012‑2015, the team tackled DB connection limits, IDC resource constraints, cross‑datacenter latency, and introduced unit‑based architecture, separating core and non‑core services into distinct units (A, B, C) with localized data and independent traffic control.

Blue‑Green deployment was adopted to limit user impact during releases, using separate Blue and Green groups within each unit and gradually shifting traffic for verification.

The final unit‑based, multi‑active architecture provides flexible traffic control, scalable resource allocation, and rapid disaster recovery across data centers, achieving strong high‑availability and DR capabilities.

distributed systemsscalabilityHigh Availabilitydisaster recoveryfailovermulti-datacenter
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.