Technical Architecture of Alipay and Ant Huabei for Large-Scale Promotional Events
The article explains how Alipay's multi-layered cloud architecture, logical data center design, distributed data strategies, and flexible transaction framework enable high availability, horizontal scalability, and rapid deployment for massive promotional traffic such as Double‑11, illustrated with the Ant Huabei case study.
Successful large‑scale promotional events require not only system and architecture optimizations—such as traffic control, caching, dependency management, and performance tuning—but also long‑term technical accumulation and refinement. This article first outlines Alipay's overall architecture and then uses the Ant Huabei service as an example of how a new business prepares for a major promotion from scratch.
Because the topics are extensive enough to form a series, this article only provides a high‑level overview, with deeper dives planned for future specialized sharing.
Architecture
Alipay's architecture must address the special requirements of internet finance, including higher business continuity, strong scalability, and rapid support for new services. The current architecture consists of three layers:
Operations Platform (IaaS) : Provides scalable basic resources such as network, storage, databases, virtualization, and IDC, ensuring the stability of the underlying platform.
Technical Platform (PaaS) : Offers scalable, highly available distributed transaction processing and service computing capabilities, abstracts middleware environments, and hides the complexity of underlying resources.
Business Platform (SaaS) : Delivers always‑available payment services and an open, secure development platform for payment applications.
Architecture Features
Logical Data Center Architecture
During Double‑11, transaction volume doubles each year, pushing system capacity, servers, networks, databases, and data centers to their limits. To cope, Alipay introduced a logical data‑center architecture that treats horizontal data sharding concepts at the access‑layer level, dividing the system into independent units with the following characteristics:
Each unit is closed to the outside, including storage access between systems.
Real‑time data within a unit is isolated, while member or configuration data with low latency requirements can be shared.
Inter‑unit communication is centrally controlled, preferring asynchronous messaging; synchronous messages use a unit‑proxy solution.
This architecture solves several key problems:
By minimizing cross‑unit interactions and using asynchronous communication, geographic deployment becomes possible, greatly improving horizontal scalability without reliance on a single‑city IDC.
It enables an N+1 disaster‑recovery strategy across regions, reducing DR costs while ensuring real‑world availability.
The system eliminates single points of failure, achieving high overall availability; multiple units in the same or different cities can serve as mutual backups, allowing near‑100% continuous availability through rapid failover.
Business‑level traffic entry and exit points become unified, controllable, and routable, enhancing overall system manageability. Features such as online stress testing, traffic control, and gray‑release, previously difficult, are now easily implemented.
The same‑city core framework was completed in 2013 and successfully withstood Double‑11, proving the architecture's practicality.
In 2015, Alipay realized a geographically distributed “active‑active” architecture based on logical data centers, where each logical data center (LDC) is fully operational and can take over traffic instantly in case of failure, offering better business continuity than the traditional “two‑site‑three‑center” model.
Beyond disaster recovery, the logical data‑center design supports blue‑green (or gray) releases. Each LDC is split into two logical sub‑centers, A and B, which are functionally identical. Under normal operation, requests are randomly routed to A or B. When blue‑green mode is activated, routing isolates A from B, allowing independent deployment and testing.
Blue‑green release steps:
Before release, set blue traffic to 0% and deploy the blue version in two unordered groups.
Gradually increase blue traffic from 1% while monitoring; if stable, ramp up to 100%.
Set green traffic to 0% and deploy the green version similarly.
Return to normal operation, with blue and green units each handling 50% of live traffic.
Distributed Data Architecture
During the 2015 Double‑11 peak, Alipay processed a peak of 85,900 transactions per second, making it the world’s largest online transaction processing system. Because transaction costs are highly sensitive for Alipay, its data architecture emphasizes low cost, linear scalability, and distributed design.
The architecture has evolved from centralized mainframes to a distributed PC‑service solution, aiming for vendor‑independence and standardization.
Scalability strategies are divided into three dimensions:
Vertical splitting by business type.
Horizontal sharding based on customer requests.
Read‑write separation and data replication for read‑heavy workloads.
The transaction system consists of three major database clusters:
Primary transaction database cluster: Handles creation and state changes of each transaction; changes are reliably replicated to two other clusters (consumer‑record and merchant‑query). Data is horizontally sharded, with each node having a standby and failover node for sub‑second switchover.
Consumer‑record database cluster: Improves user experience for consumers.
Merchant‑query database cluster: Improves experience for merchants.
To keep these data nodes transparent to upper‑level applications, Alipay built a middleware product that provides elastic expansion of transaction data.
Data Reliability
In a distributed data architecture, maintaining ACID properties while achieving high availability and scalability is challenging. Alipay designed a flexible transaction framework that follows a two‑phase commit protocol but adds many optimizations to preserve ACID and ensure eventual consistency, referred to as the “flexible transaction” strategy.
Implementation highlights:
A business activity consists of a primary service and several subsidiary services.
The primary service initiates and completes the whole activity.
Subsidiary services provide TCC‑style operations.
The activity manager records operations and, upon commit, invokes all confirm actions; upon cancel, it invokes all cancel actions.
Compared with classic 2PC:
No separate Prepare phase, reducing protocol overhead.
Higher fault tolerance and simpler recovery.
Key component – asynchronous reliable messaging strategy:
Important design points:
If failures occur in steps 2‑4, the business system decides whether to roll back or compensate; if failures occur in steps 6‑7, the message center must query the producer; if step 8 fails, the message center retries. Confirmation messages are encapsulated by the message center, invisible to the application.
This mechanism guarantees message integrity and, consequently, final data consistency across systems using asynchronous reliable messaging.
Some business pre‑checks require the message center to provide conditional query mechanisms.
Ant Huabei
Ant Huabei is a new “buy now, pay later” service that achieved a 99.99% success rate and an average payment latency of 0.035 seconds during Double‑11, matching major bank channels.
Within a year, its throughput grew from 10 TPS at launch to a peak of 21 k TPS on Double‑11, fully supported by Ant Financial’s cloud architecture.
In December 2014, the team migrated the system to the financial cloud, integrating channel, business, core, and data layers to provide a unified user experience.
By April 2015, Ant Huabei adopted the cloud’s unit‑based (LDC) construction, enabling geographic distribution, high scalability, and traffic control. Deep integration with the cloud’s accounting system provided failover capabilities, ensuring both same‑city and cross‑region disaster recovery without affecting users.
Risk control is performed in real time: as soon as a buyer places an order, multiple fraud‑detection and credit‑risk models run in parallel, completing within 20 ms to decide whether the transaction is safe before reaching the checkout.
To guarantee sufficient credit during Double‑11, Ant Financial built an institutional asset center that packages loan assets into a securitized pool, issuing tradable securities to raise funds. This asset‑securitization platform processes over a hundred‑million transactions per hour and supports tens of billions of yuan in asset transfers, helping over a million small‑and‑medium enterprises obtain financing.
Summary
After years of building high‑availability architecture and promotion preparation, Ant Financial’s technical team follows a three‑pillar approach: “Strategy” (overall architectural design), “Tools” (underlying middleware and components), and “Talent” (experienced engineers).
While many share architectural ideas (“Strategy”), real success depends on solid “Tools” and seasoned “Talent” that have survived countless production incidents.
In today’s fast‑moving market, teams must quickly build platform capabilities and focus on business development, leveraging cloud‑based shared services and the decade‑long foundation of Alipay’s components and expertise.
Source: InfoQ Original article: http://www.infoq.com/cn/articles/technical-architecture-of-alipay-and-ant-check-later
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
