Understanding Ant Financial’s LDC Architecture: Partitioning, CAP Analysis, and Multi‑Active Disaster Recovery
This article explains how Ant Financial’s logical data center (LDC) architecture uses unitization, database sharding, and CAP‑aware design—including RZone, GZone, and CZone—to achieve tens of millions of TPS during Double‑11, while providing multi‑active disaster recovery and high availability.
Since the first Double 11 in 2008, Ant Financial’s payment TPS has grown from 20,000 transactions per minute to 544,000 per second in 2019, pushing the system to break existing technical limits.
The core breakthrough is the Logical Data Center (LDC), a unit‑based architecture that treats distributed systems as logically unified despite physical dispersion.
Key Questions
What is the most crucial design behind Ant Financial’s massive payment throughput?
What is LDC and how does it achieve multi‑active and disaster‑recovery capabilities?
How do CAP, PAXOS, and partition tolerance relate to this design?
The article presents a straightforward, jargon‑free explanation of these concepts.
LDC and Unitization
LDC (Logic Data Center) contrasts with traditional IDC by ensuring that, regardless of physical distribution, the entire data center operates as a coordinated logical unit.
Unitization means each large internet company becomes a collection of independent units (or shards) that serve disjoint user groups, allowing the overall system to scale by adding more units.
For example, a single e‑commerce platform may handle at most 100k TPS; by splitting users into multiple units, each unit can handle 100k TPS, and the combined system can achieve N × 100k TPS.
Each unit isolates its data and services, enabling independent deployment across data centers.
System Architecture Evolution
Early architectures placed all functions in a single monolithic application, leading to single‑point failures and limited capacity.
To address this, engineers introduced horizontal scaling, distributing traffic across multiple machines.
However, the database became the new bottleneck, prompting the adoption of master‑slave clusters and eventually sharding (partitioning) both databases and tables.
Sharding splits data by user ID (horizontal) or by business function (vertical), reducing the load on any single database instance.
Routing logic moves from the application layer to the gateway layer, allowing requests to be directed directly to the appropriate unit based on user ID.
Example routing flow:
RZ0* --> a
RZ1* --> b
RZ2* --> c
RZ3* --> dTraffic is first resolved by a global load balancer (GLSB) that maps the client IP to the nearest IDC, then the request is forwarded to the appropriate RZone (Region Zone) for processing.
CRG Architecture (RZone, GZone, CZone)
Ant Financial classifies units into three zones:
RZone (Region Zone) : Handles partitioned data for a specific user segment; each RZone can independently serve its users.
GZone (Global Zone) : Stores globally shared data (e.g., system configuration) and is deployed as a single active instance per region, with standby copies for disaster recovery.
CZone (City Zone) : Optimized for data with a write‑read delay; it stores a local copy of GZone data to serve fast reads while writes go through GZone.
This three‑zone model balances consistency, availability, and partition tolerance.
Disaster Recovery Levels
Ant’s LDC supports three disaster‑recovery tiers:
Intra‑rack/unit failover
Intra‑city (same‑city) failover
Inter‑city (cross‑region) failover
During a failure, traffic is re‑routed by updating the mapping of user ID ranges to healthy RZones, and the database partition mappings are switched accordingly.
Example of re‑mapping after a failure:
[00-24] --> RZ2A(50%),RZ2B(50%)
[25-49] --> RZ3A(50%),RZ3B(50%)
[50-74] --> RZ2A(50%),RZ2B(50%)
[75-99] --> RZ3A(50%),RZ3B(50%)CAP Analysis of LDC and OceanBase
The article revisits the CAP theorem, explaining consistency, availability, and partition tolerance, and how they apply to different architectural stages.
Early monolithic or single‑database systems are CP (consistent and partition‑tolerant) but lack availability.
Horizontal scaling with stateless application servers yields AP (available and partition‑tolerant) but sacrifices consistency at the database layer.
Ant’s LDC, backed by OceanBase, aims for AP with eventual consistency. OceanBase uses the PAXOS consensus algorithm, requiring a quorum of (N/2)+1 nodes for writes, thus providing partition tolerance while maintaining high availability.
During a network partition, only the majority partition can accept writes; the minority partition’s writes are rejected, preventing split‑brain scenarios.
Conclusion
The massive Double 11 payment throughput is achieved through:
User‑based sharding (RZone) that isolates traffic per user segment.
PAXOS‑based consensus in OceanBase to avoid split‑brain and ensure eventual consistency.
CZone’s local‑read optimization for data with write‑read delay.
Robust multi‑active disaster‑recovery mechanisms across three data centers.
These design choices, combined with operational practices like traffic shaping and pre‑planned failover, enable Ant Financial to sustain tens of millions of TPS.
For further reading, the article lists several references on cloud system administration, MySQL semi‑sync replication, BASE theory, Keepalived, PAXOS, and OceanBase technical details.
--- © 2024 Ant Financial Architecture Blog. All rights reserved.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
