Databases 17 min read

How Ant Financial Scales Payments: Data Patterns and Disaster‑Recovery Strategies

This article examines how Ant Financial redesigned its payment system’s data layer after retiring mainframes, detailing RTO/RPO goals, CAP trade‑offs, vertical and horizontal sharding, blacklist‑based accounting DR, failover mechanisms for transactions, and the role of OceanBase in achieving strong consistency and low‑latency recovery.

ITPUB

Jun 1, 2020

How Ant Financial Scales Payments: Data Patterns and Disaster‑Recovery Strategies

Background

In May 2013 Alipay retired its last mainframe, moving from a traditional IOE (Integrated Oracle Engine) stack to a distributed architecture that must deliver financial‑grade performance and reliability on commodity hardware.

Disaster‑Recovery Metrics

RTO (Recovery Time Objective) : maximum tolerable downtime from failure to restored service.

RPO (Recovery Point Objective) : maximum tolerable data loss; RPO = 0 means no data loss is acceptable.

CAP Trade‑offs

Because consistency, availability, and partition tolerance cannot be simultaneously guaranteed, Ant Financial adopts different trade‑offs per business scenario and groups workloads into a few recurring patterns.

Simplified Payment System Model

The model follows an SOA design with four core subsystems, each backed by its own database, plus a global configuration store:

Accounting – balances and ledger‑type state.

Transaction – independent order/flow records.

User – profile information.

Operations support – internal workflow and admin data.

Data Partitioning Strategies

Vertical (module‑level) splitting separates monolithic databases into business‑specific stores, enabling horizontal scaling but introducing distributed‑transaction complexity.

Horizontal sharding distributes rows across logical tables and physical databases. A common technique is modulo‑based sharding, e.g. shard_id = account_id % 10; which maps to ten logical tables and ten physical DB instances. Middleware hides routing logic so the application continues to issue ordinary SQL.

Accounting System – Blacklist‑Based Disaster Recovery

Accounting requires strong consistency (RPO = 0). When a shard fails, the system cannot rely on asynchronous primary‑backup replication because of potential lag. Instead, a blacklist is built that contains all accounts that have been modified on the failed primary within a short window. The steps are:

Detect primary shard failure.

Identify recently updated accounts (e.g., using a change‑log or timestamp table).

Mark those accounts in a blacklist table.

During recovery, block any new write operations on blacklisted accounts.

Allow all other accounts to continue using the backup replica for reads and writes.

The blacklist typically contains only dozens to a few hundred accounts, reducing the impact from an entire shard to a tiny subset of users.

OceanBase – Paxos‑Based Strong Consistency

OceanBase is a distributed database that implements the Paxos consensus protocol. It provides:

RPO = 0 (no data loss) for single‑node failures.

RTO < 30 seconds for automatic failover.

Transparent disaster‑recovery semantics to the application layer.

After its adoption, the accounting subsystem migrated from the blacklist approach to a single OceanBase cluster, simplifying the code base.

Transaction System – Failover Shard

Transaction data is also horizontally sharded. In addition to the primary shard, a “Failover” shard is provisioned with identical schema but no data sync. Each transaction ID contains an “elastic bit”:

// Example transaction ID layout (simplified)
| 31‑24 | 23‑16 | 15‑8 | 7‑0 |
|  shard |  seq  |  type| elastic_bit |

When the elastic bit = 1 the record is stored in the primary shard; when it = 2 the record is stored in the Failover shard. Recovery procedure:

Detect primary shard outage.

Issue a control command that flips the elastic bit for the affected shard to 2.

New transactions for that shard are written to the Failover shard.

Existing in‑flight transactions are either aborted or recreated against the Failover shard.

Non‑critical operations (e.g., status updates on already‑paid orders) are paused until the primary is restored.

After primary‑backup synchronization, flip the elastic bit back to 1 and resume normal routing.

The Failover mechanism isolates the critical path (transaction creation and payment) to a few seconds while allowing less urgent work to wait.

Configuration Data

Configuration tables are read‑heavy, write‑light, and tolerate eventual consistency. Typical deployment uses a primary for writes and multiple read‑replicas or distributed caches (e.g., Redis, Memcached) for low‑latency reads.

User System

User profile data follows a similar pattern to configuration data:

One write‑master database.

Asynchronous replication to several read replicas.

In‑memory cache for hot lookups (username, email, phone).

If the write master fails, new registrations and profile updates are unavailable, but read‑only operations (login, profile view) continue from replicas.

Operations Support System

This subsystem stores “global‑state” data with a balanced read/write ratio. Because the data cannot be cleanly sharded, it must either:

Use a strongly consistent database such as OceanBase, or

Accept longer RTO/RPO during a failover (e.g., wait for primary‑backup switch).

Data Type Classification

Stateful (balanced read/write) : each write depends on the previous correct state (e.g., account balances). Hardest to satisfy both RTO and RPO.

Flow (write‑once, independent) : records are independent and can be horizontally scaled (e.g., transaction logs). Strong consistency required per record.

Configuration (read‑dominant) : writes are infrequent; reads dominate and can tolerate slight staleness.

Design Recommendations

When designing a new service, first classify each entity into one of the three data types. Then apply the appropriate pattern:

Flow data → horizontal sharding + simple primary‑replica or Failover shard.

Stateful data → keep shards small, consider strong‑consistency stores (OceanBase) or, if using traditional RDBMS, employ blacklist‑based DR to limit impact.

Configuration data → read‑write separation, cache layer, eventual consistency.

Identify the minimal set of global‑state tables and isolate them behind a strongly consistent database to avoid cascading failures.

Conclusion

Ant Financial’s experience shows that abstracting “account‑like” workloads, using a blacklist‑based disaster‑recovery scheme for legacy databases, and finally migrating to a Paxos‑based distributed database (OceanBase) can meet stringent RTO/RPO requirements while simplifying application logic. The three‑type data classification provides a practical framework for scaling and protecting financial‑grade services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data modeling Disaster Recovery database sharding financial technology

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.