How Alibaba Conquered Double 11: Scaling to 17.5k TPS with High‑Availability Architecture
Alibaba’s eight‑year Double 11 journey illustrates how the company tackled exponential business growth by inventing high‑availability middleware, precise capacity planning, unit‑based deployment, online stress testing, hybrid‑cloud elasticity, and intelligent runtime control to balance throughput, cost, and user experience during the midnight peak.
Background
Alibaba's e‑commerce platform grew exponentially over eight years, increasing transaction volume from 0.59 billion RMB in 2009 to 120.7 billion RMB in 2016 and peak QPS from 400 to 175 000. This created a world‑class “zero‑point” challenge: delivering maximum throughput and optimal user experience at the midnight peak with limited cost.
Technical Challenges
Key challenges included horizontal scalability, precise capacity planning, rapid cost growth, fine‑grained runtime control, and stability governance. The lack of external references forced Alibaba to innovate its own high‑availability middleware across several architecture generations.
Architecture Evolution
From a centralized 2007‑2008 design, the system migrated to a layered distributed architecture with shared services, caching, and storage clusters. Over time, the architecture incorporated multi‑datacenter active‑active deployment, cross‑region unitization, and a custom distributed traffic engine capable of generating tens of millions of QPS.
Capacity Planning & Online Stress Testing
Capacity planning began with baseline performance measurement, evolving from offline load tests to online traffic siphoning that gradually redirects traffic from many servers to a few to observe load‑response curves. Full‑chain online stress tests expose bottlenecks only under real‑world traffic, enabling thousands of issues to be discovered each year.
Unit‑Based Deployment & Hybrid Cloud
Alibaba partitioned the system into “units” that host complete buyer‑side services, allowing independent scaling, rapid failover, and seamless data synchronization. Hybrid‑cloud elasticity lets Alibaba provision massive resources for Double 11 and release them afterward, dramatically reducing cost.
Runtime Control, Flow Scheduling & Stability Governance
Dynamic throttling, downstream degradation, and traffic routing based on real‑time health metrics protect the cluster from overload. Middleware tracks call chains, aggregates stability metrics, and guides automated degradation decisions.
Future Directions
Future work aims for finer‑grained, data‑driven, and intelligent operations: per‑machine kernel allocation, predictive traffic modeling, and self‑decision making to achieve deterministic capacity and resource usage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
