Alipay Double‑11 System Stability Practices: Distributed Architecture, Elastic Scaling, Service Mesh, Full‑Chain Load Testing, Intelligent Monitoring, and OceanBase
The presentation details Alipay's evolution through three stability phases—capacity, elastic cloud‑native architecture, and green computing—covering unit‑based deployment, elastic scaling, ServiceMesh, full‑chain load testing, intelligent monitoring, and the OceanBase distributed database, illustrating how these techniques achieved 99.99% availability during the 2021 Double‑11 peak.
On April 27, the first Global Information System Stability Summit was held in Beijing, where Ant Group was recognized for its exemplary system stability practices during the Double‑11 shopping festival.
Stone Shiqun, Deputy General Manager of Ant Group's Digital Technology Division, delivered a speech titled “Alipay System Double‑11 Stability Experience Sharing,” describing the evolution of Alipay’s architecture from 2004 to 2021.
The evolution is divided into three stages:
Stage 1 focused on capacity, using Logical Data Center (LDC), elastic capabilities, and OceanBase to achieve theoretically unlimited scaling, validated by full‑link stress testing.
Stage 2 emphasized architectural stability and efficiency through cloud‑native designs such as ServiceMesh and intelligent monitoring, enabling rapid incident response.
Stage 3 targeted green computing, achieving zero‑cost capacity growth and reducing 640,000 kWh electricity and 394 tons of carbon during the 2021 Double‑11 event.
Key technical components include:
Unit‑based Deployment : Logical Data Center (LDC) partitions IDC resources into self‑contained units (RZone, GZone, CZone) to eliminate single‑point bottlenecks and ensure geographic disaster recovery.
Elastic Architecture : Elastic units can be spun up to the cloud during traffic spikes and returned to on‑premise after the event, cutting resource costs by over 50%.
Service Mesh : Deployed as sidecar proxies, ServiceMesh handles traffic control, rate limiting, and circuit breaking without code changes, supporting 100% of Alipay’s core payment flow with millions of containers and tens of millions of QPS.
Full‑Chain Online Load Testing : Uses end‑to‑end user behavior models, production‑like test environments, and comprehensive performance diagnostics to achieve >99% simulation accuracy and zero major incidents.
Intelligent Monitoring : Ant’s proprietary time‑series database CeresDB enables minute‑level fault detection, five‑minute root‑cause analysis, and ten‑minute recovery.
Green Computing : Hybrid offline‑online deployment, cloud‑native scheduling, and AI‑driven capacity planning reduce energy consumption and carbon emissions.
OceanBase Distributed Database : A native HTAP database offering unlimited scaling, automatic disaster recovery within 30 seconds, and strong consistency for high‑availability financial services.
Ant Group has packaged these capabilities into the SOFAStack cloud‑native product suite, making them available to external customers to accelerate digital transformation across industries.
The speaker concluded by thanking the audience.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.