Databases 37 min read

Inside Ant Financial’s LDC Architecture: Scaling Double‑11 Payments with OceanBase and CAP Theory

This article explains how Ant Financial’s logical data center (LDC) and unitized architecture, combined with OceanBase’s Paxos‑based consensus, enable the massive TPS growth for Double‑11 payments while addressing sharding, CAP trade‑offs, traffic diversion, and multi‑site disaster recovery.

Programmer DD
Programmer DD
Programmer DD
Inside Ant Financial’s LDC Architecture: Scaling Double‑11 Payments with OceanBase and CAP Theory

LDC and Unitization

Since 2008, Ant Financial has continuously pushed technical limits for Double‑11 traffic, reaching 2 million TPS in 2010, 2.56 million TPS in 2017, 4.8 million TPS in 2018, and 5.44 million TPS in 2019 – a 1 360‑fold increase over the first Double‑11.

The core breakthrough behind this scale is the logical data center (LDC), a unit‑based design that abstracts physical distribution and treats the whole data center as a logically unified system.

Key Questions

What is the most decisive design behind Alipay’s massive payment throughput?

What is LDC and how does it achieve multi‑site active‑active and disaster‑recovery?

What are CAP, P, and Paxos?

What is the difference between brain split and CAP?

Can OceanBase escape the CAP constraints?

Unitization Basics

Unitization means splitting a large internet system into independent units, each serving a distinct user segment. By deploying many units, the total TPS can be multiplied (e.g., each unit handles 100 k TPS, N units achieve N × 100 k TPS).

In practice, each unit runs its own application instances and its own database shard, eliminating cross‑unit database connections and reducing the Cartesian product of connections.

System Architecture Evolution

Early architectures placed all functions in a single monolithic application, leading to single‑point failures and limited capacity.

Horizontal scaling introduced multiple application servers behind a load balancer, improving availability but still suffering from database bottlenecks.

Master‑slave database clusters alleviated read pressure but left write bottlenecks; sharding (horizontal table partitioning) and vertical partitioning (service‑oriented micro‑services) further increased capacity.

Alipay’s CRG Architecture

Alipay classifies units into three zones:

RZone (Region Zone) : Handles user‑specific data via sharding; each RZone owns its database partitions.

GZone (Global Zone) : Stores globally shared data (e.g., system configuration) that cannot be sharded; only one instance exists, deployed in multiple data centers for disaster recovery.

CZone (City Zone) : Stores data that exhibits a “write‑read time gap” (e.g., member profiles). CZone provides local read access while writes go through GZone and are asynchronously replicated.

The “write‑read time gap” means most data is read long after it is written, allowing local reads without compromising consistency.

Traffic Diversion and Multi‑Site Active‑Active

Traffic is first routed by a global server load balancer (GLSB) based on client IP to the nearest IDC. The request then reaches the appropriate Spanner proxy, which forwards it to the correct RZone. If the target RZone resides in another IDC, the request is redirected, achieving true multi‑site active‑active.

During disaster recovery, traffic diversion (cut‑over) reassigns user‑ID to RZone mappings and database partition ownership to healthy units, ensuring continued service.

CAP Analysis

CAP states that a distributed system can satisfy at most two of Consistency (C), Availability (A), and Partition tolerance (P). The article reviews the definitions of each and explains how to evaluate a system’s CAP properties.

Typical horizontal‑scaled services with a single database are CP (consistent but not highly available under partition). Adding read replicas yields AP (high availability, partition tolerant, but weaker consistency).

OceanBase and CAP

OceanBase (OB) is Ant Financial’s self‑developed distributed database. It uses Paxos consensus, requiring a quorum of (N/2)+1 nodes for each transaction, thus providing partition tolerance and high availability (AP) while achieving eventual consistency (C) after partitions heal.

During a network partition, only the quorum can commit writes; other nodes remain read‑only, preventing brain split. After the partition resolves, OB synchronizes data to achieve final consistency.

Conclusion

The massive Double‑11 payment capability stems from:

RZone‑based sharding that isolates user groups into independent units.

OB’s Paxos‑driven consensus that prevents brain split during partitions or disaster recovery.

CZone’s local‑read design that exploits the write‑read time gap for high‑speed access.

GZone’s globally shared data kept minimal to reduce cross‑zone latency.

Combined with operational techniques (traffic shaping, pre‑warming, multi‑IDC deployment), these designs enable Alipay to handle tens of millions of TPS and continue scaling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsCAP theoremdisaster recoveryOceanBaseAnt Financial
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.