Understanding Ant Financial’s LDC Architecture: Unitization, CAP Analysis, and High‑TPS Design
This article explains how Ant Financial’s massive Double‑11 payment traffic is handled through logical data centers (LDC), unit‑based architecture (RZone, GZone, CZone), traffic routing, disaster‑recovery strategies, and a CAP analysis that highlights the role of OceanBase’s Paxos‑based consensus in achieving high availability and eventual consistency.
Since the first Double‑11 in 2008, Ant Financial’s payment TPS has grown from 20,000 transactions per minute to over 540,000 per second, forcing the system to break through traditional scaling limits.
The key breakthrough is the Logical Data Center (LDC) concept, which treats a distributed set of machines as a single logical unit regardless of physical location. LDC enables massive horizontal scaling by partitioning data per user (RZone), sharing immutable data globally (GZone), and providing fast local reads for data with a write‑read delay (CZone).
Unitization means each user group is assigned to an exclusive RZone, allowing independent scaling; multiple RZones can be deployed across different IDC rooms, providing both capacity and fault isolation. CZone stores data that does not need immediate consistency, allowing reads to be served locally after a short delay, dramatically reducing cross‑region traffic.
Traffic routing is performed by a custom Global Server Load Balancer (GLSB) and a reverse‑proxy gateway called Spanner. Requests are first directed to the IDC closest to the user, then to the appropriate RZone based on the user’s ID. If a request involves data from another user, the system may route to a different IDC or RZone, but the design strives to keep most traffic within a single region.
Disaster recovery is organized into three levels: intra‑machine‑room, intra‑city, and inter‑city. When a machine room fails, traffic and data‑partition mappings are re‑assigned to healthy RZones, and the system switches traffic accordingly. The process is pre‑planned and automated to avoid service disruption.
CAP analysis shows that the overall system is AP (high Availability and Partition tolerance) with eventual consistency (C) provided by the underlying database. The database layer uses OceanBase, a Paxos‑based distributed database that achieves consensus with (N/2)+1 nodes, ensuring that even during network partitions the system remains available and eventually consistent.
Key code snippets illustrate the mapping of RZones to data partitions and the conditional logic used to determine the system’s CAP classification:
RZ0* --> a
RZ1* --> b
RZ2* --> c
RZ3* --> d if (no partition possibility || partition does not affect availability or consistency || considered partition tolerance) {
if (availability partition tolerance under P) return "AP";
else if (consistency partition tolerance under P) return "CP";
} else {
if (has availability && has consistency) return "AC";
}In summary, Ant Financial’s LDC architecture combines unit‑based data partitioning, intelligent traffic routing, robust disaster‑recovery mechanisms, and a Paxos‑driven database to handle massive TPS while maintaining high availability and eventual consistency.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.