How Alipay Handles 540K TPS: Inside LDC’s Unit‑Based Architecture and CAP Strategies
This article analyzes the massive traffic handling of Alipay during Double 11, explaining the LDC (Logical Data Center) unit‑based design, the RZone‑GZone‑CZone hierarchy, traffic steering, disaster‑recovery mechanisms, and how OceanBase and Paxos enable CAP compliance for ultra‑high‑availability payments.
Background
Since the first Double 11 in 2008, Ant Financial’s payment peak grew from 20,000 transactions per minute to 544,000 transactions per second in 2019, a 1,360‑fold increase. To sustain this growth, the company introduced a logical data center (LDC) architecture based on extensive sharding and unitization.
LDC and Unitization
LDC (Logical Data Center) abstracts physical data centers into a logically unified system. Unitization means each user group is served by an isolated unit, eliminating the single‑point bottleneck of traditional databases.
Key benefits of unitization:
Scalable capacity by adding more units.
Isolation of workloads reduces cross‑unit interference.
Each unit can be deployed independently across data centers.
System Architecture Evolution
Early monolithic applications ran on a single server, leading to severe capacity limits. Horizontal scaling introduced multiple application instances sharing a single database, which improved availability but created database bottlenecks.
Introducing master‑slave replication alleviated read pressure but left write performance constrained, prompting the shift to sharding (horizontal partitioning) and eventually to unit‑based deployment.
Alipay’s Unit‑Based Architecture (CRG)
Alipay classifies units into three zones:
RZone (Region Zone) : Handles user‑specific data after sharding; each zone can serve a fixed user range.
GZone (Global Zone) : Stores globally shared data (e.g., system configuration) and is deployed as a single instance per region for consistency.
CZone (City Zone) : Optimizes data with a “write‑read delay” pattern, storing data locally for fast reads while writes are eventually synced from GZone.
RZone and CZone together form the core of Alipay’s LDC design, while GZone provides the shared backbone.
Traffic Steering and Disaster Recovery
Traffic is first routed by a global load balancer (GLSB) based on client IP, directing requests to the appropriate IDC. The request then passes through a reverse‑proxy layer (Spanner) that consults routing tables to forward the request to the correct RZone.
When a unit fails, traffic is re‑routed to a standby unit within the same data center (same‑room failover), to another data center in the same city (cross‑city failover), or to a remote data center (cross‑region failover). The process involves two steps: first reassigning database partition ownership, then updating the user‑to‑RZone mapping.
RZ0* --> a
RZ1* --> b
RZ2* --> c
RZ3* --> dAfter failover, the mapping might become:
RZ0* --> /
RZ1* --> /
RZ2* --> a
RZ2* --> c
RZ3* --> b
RZ3* --> dCAP Analysis of LDC
The CAP theorem states that a distributed system can satisfy at most two of Consistency, Availability, and Partition tolerance. Alipay’s LDC aims to achieve high availability and partition tolerance (AP) while providing eventual consistency.
Key observations:
Partition tolerance is addressed by allowing operations to succeed as long as a majority (N/2 + 1) of nodes are reachable.
Availability is maintained by designing transactions that do not require all nodes to participate.
Consistency is achieved eventually through the underlying OceanBase database, which uses Paxos consensus.
OceanBase and Paxos
OceanBase (OB) is Ant’s self‑developed distributed database. It employs Paxos to achieve consensus among replicas, ensuring that only one value is committed during a network partition.
During a partition, a write proposal must be accepted by a quorum (N/2 + 1) nodes; proposals that cannot reach a quorum are discarded, preventing split‑brain scenarios. After the partition heals, OB synchronizes replicas to reach final consistency.
Conclusions
The success of Double 11 payments stems from:
Sharding users into RZones, enabling linear capacity growth.
Using OceanBase’s Paxos‑based consensus to avoid split‑brain failures during network partitions or disaster recovery.
Deploying CZones for fast local reads of data with a natural write‑read delay.
Combining global, regional, and city zones to balance consistency and availability.
These techniques, together with operational practices such as traffic shaping and multi‑region deployment, allow Alipay to sustain hundreds of thousands of transactions per second.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
