How Ant Financial Scales Payments: Data Patterns and Disaster‑Recovery Strategies
This article examines how Ant Financial redesigned its payment system’s data layer after retiring mainframes, detailing RTO/RPO goals, CAP trade‑offs, vertical and horizontal sharding, blacklist‑based accounting DR, failover mechanisms for transactions, and the role of OceanBase in achieving strong consistency and low‑latency recovery.
Background
In May 2013 Alipay retired its last mainframe, moving from a traditional IOE (Integrated Oracle Engine) stack to a distributed architecture that must deliver financial‑grade performance and reliability on commodity hardware.
Disaster‑Recovery Metrics
RTO (Recovery Time Objective) : maximum tolerable downtime from failure to restored service.
RPO (Recovery Point Objective) : maximum tolerable data loss; RPO = 0 means no data loss is acceptable.
CAP Trade‑offs
Because consistency, availability, and partition tolerance cannot be simultaneously guaranteed, Ant Financial adopts different trade‑offs per business scenario and groups workloads into a few recurring patterns.
Simplified Payment System Model
The model follows an SOA design with four core subsystems, each backed by its own database, plus a global configuration store:
Accounting – balances and ledger‑type state.
Transaction – independent order/flow records.
User – profile information.
Operations support – internal workflow and admin data.
Data Partitioning Strategies
Vertical (module‑level) splitting separates monolithic databases into business‑specific stores, enabling horizontal scaling but introducing distributed‑transaction complexity.
Horizontal sharding distributes rows across logical tables and physical databases. A common technique is modulo‑based sharding, e.g. shard_id = account_id % 10; which maps to ten logical tables and ten physical DB instances. Middleware hides routing logic so the application continues to issue ordinary SQL.
Accounting System – Blacklist‑Based Disaster Recovery
Accounting requires strong consistency (RPO = 0). When a shard fails, the system cannot rely on asynchronous primary‑backup replication because of potential lag. Instead, a blacklist is built that contains all accounts that have been modified on the failed primary within a short window. The steps are:
Detect primary shard failure.
Identify recently updated accounts (e.g., using a change‑log or timestamp table).
Mark those accounts in a blacklist table.
During recovery, block any new write operations on blacklisted accounts.
Allow all other accounts to continue using the backup replica for reads and writes.
The blacklist typically contains only dozens to a few hundred accounts, reducing the impact from an entire shard to a tiny subset of users.
OceanBase – Paxos‑Based Strong Consistency
OceanBase is a distributed database that implements the Paxos consensus protocol. It provides:
RPO = 0 (no data loss) for single‑node failures.
RTO < 30 seconds for automatic failover.
Transparent disaster‑recovery semantics to the application layer.
After its adoption, the accounting subsystem migrated from the blacklist approach to a single OceanBase cluster, simplifying the code base.
Transaction System – Failover Shard
Transaction data is also horizontally sharded. In addition to the primary shard, a “Failover” shard is provisioned with identical schema but no data sync. Each transaction ID contains an “elastic bit”:
// Example transaction ID layout (simplified)
| 31‑24 | 23‑16 | 15‑8 | 7‑0 |
| shard | seq | type| elastic_bit |When the elastic bit = 1 the record is stored in the primary shard; when it = 2 the record is stored in the Failover shard. Recovery procedure:
Detect primary shard outage.
Issue a control command that flips the elastic bit for the affected shard to 2.
New transactions for that shard are written to the Failover shard.
Existing in‑flight transactions are either aborted or recreated against the Failover shard.
Non‑critical operations (e.g., status updates on already‑paid orders) are paused until the primary is restored.
After primary‑backup synchronization, flip the elastic bit back to 1 and resume normal routing.
The Failover mechanism isolates the critical path (transaction creation and payment) to a few seconds while allowing less urgent work to wait.
Configuration Data
Configuration tables are read‑heavy, write‑light, and tolerate eventual consistency. Typical deployment uses a primary for writes and multiple read‑replicas or distributed caches (e.g., Redis, Memcached) for low‑latency reads.
User System
User profile data follows a similar pattern to configuration data:
One write‑master database.
Asynchronous replication to several read replicas.
In‑memory cache for hot lookups (username, email, phone).
If the write master fails, new registrations and profile updates are unavailable, but read‑only operations (login, profile view) continue from replicas.
Operations Support System
This subsystem stores “global‑state” data with a balanced read/write ratio. Because the data cannot be cleanly sharded, it must either:
Use a strongly consistent database such as OceanBase, or
Accept longer RTO/RPO during a failover (e.g., wait for primary‑backup switch).
Data Type Classification
Stateful (balanced read/write) : each write depends on the previous correct state (e.g., account balances). Hardest to satisfy both RTO and RPO.
Flow (write‑once, independent) : records are independent and can be horizontally scaled (e.g., transaction logs). Strong consistency required per record.
Configuration (read‑dominant) : writes are infrequent; reads dominate and can tolerate slight staleness.
Design Recommendations
When designing a new service, first classify each entity into one of the three data types. Then apply the appropriate pattern:
Flow data → horizontal sharding + simple primary‑replica or Failover shard.
Stateful data → keep shards small, consider strong‑consistency stores (OceanBase) or, if using traditional RDBMS, employ blacklist‑based DR to limit impact.
Configuration data → read‑write separation, cache layer, eventual consistency.
Identify the minimal set of global‑state tables and isolate them behind a strongly consistent database to avoid cascading failures.
Conclusion
Ant Financial’s experience shows that abstracting “account‑like” workloads, using a blacklist‑based disaster‑recovery scheme for legacy databases, and finally migrating to a Paxos‑based distributed database (OceanBase) can meet stringent RTO/RPO requirements while simplifying application logic. The three‑type data classification provides a practical framework for scaling and protecting financial‑grade services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
