Operations 16 min read

High Availability and Disaster Recovery Architecture: The Evolution of Alipay’s System Design

This article examines the importance of high‑availability and disaster‑recovery architectures, tracing Alipay’s evolution from a simple load‑balanced setup through multi‑datacenter, failover, and unit‑based designs that address scalability, data consistency, and continuous service delivery challenges.

Architecture Digest

May 9, 2018

High Availability and Disaster Recovery Architecture: The Evolution of Alipay’s System Design

High availability and disaster recovery (DR) are critical for enterprise services, cloud computing, and mobile internet platforms, ensuring uninterrupted service and user confidence, especially during traffic spikes such as China’s "Double 11" shopping festival.

Early Alipay architecture (2004‑2011) relied on commercial load balancers and a single database per core system, leading to single‑point failures and limited DR capabilities.

In the second stage (2011‑2012), Alipay split logical data centers, introduced soft load balancing, and implemented horizontal data sharding based on user UID, reducing single‑point bottlenecks and enabling multi‑datacenter active‑active deployment.

A dedicated Failover layer was added to handle master‑slave switchovers within minutes, preserving data integrity and minimizing service disruption.

From 2012‑2015, the team tackled DB connection limits, IDC resource constraints, cross‑datacenter latency, and introduced unit‑based architecture, separating core and non‑core services into distinct units (A, B, C) with localized data and independent traffic control.

Blue‑Green deployment was adopted to limit user impact during releases, using separate Blue and Green groups within each unit and gradually shifting traffic for verification.

The final unit‑based, multi‑active architecture provides flexible traffic control, scalable resource allocation, and rapid disaster recovery across data centers, achieving strong high‑availability and DR capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems scalability High Availability Disaster Recovery failover multi-datacenter

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.