Capacity Planning, Full‑Link Stress Testing, and Traffic Control for Alibaba's Double‑11 Mega‑Event
The article explains how Alibaba introduced systematic capacity planning, four‑stage capacity assessment, various single‑machine stress‑test techniques, and a full‑link stress‑testing platform to reliably handle the massive traffic spikes of the Double‑11 shopping festival, while also describing a flexible traffic‑control framework to prevent overload and avalanche effects.
Since Double‑11 began in 2009, the 2013 edition marked a turning point when Alibaba adopted full‑link stress testing to cope with unprecedented traffic peaks, reaching 101.2 billion CNY in 24 hours and a transaction creation peak of 172 k per second.
Capacity planning became essential to determine how many machines each business system needed during large‑scale promotions, answering when to add or remove resources while balancing stability and cost.
The planning process is divided into four stages:
Business traffic estimation – using historical data and prediction algorithms to forecast future request volumes.
System capacity evaluation – calculating the number of machines required for each subsystem.
Capacity fine‑tuning – employing full‑link stress tests to simulate real‑world user behavior and adjust capacity thresholds.
Traffic control – configuring rate‑limiting and protection measures to keep the system within safe operating limits.
To obtain per‑machine service capability, Alibaba performs single‑machine stress tests directly in production, using four methods: simulated requests, request replication, request forwarding, and load‑balancer weight adjustment, each with its own trade‑offs regarding realism and data contamination.
A dedicated stress‑test platform automates scheduling, execution, and real‑time monitoring, stopping tests when system load exceeds predefined thresholds and generating detailed reports.
Full‑link stress testing replicates the entire Double‑11 scenario in production, generating over 10 million requests per second via a control node and thousands of worker nodes, while isolating test data in a shadow environment to avoid polluting live data.
Since its introduction, full‑link testing has uncovered hundreds of issues each year, dramatically improving site stability and becoming a mandatory step for all major Alibaba promotions.
When traffic exceeds capacity, the system employs a flexible flow‑control framework that monitors runtime status, call relationships, and control policies, allowing actions such as request dropping, downstream degradation, blacklisting, or queuing to prevent avalanche failures.
The combined use of precise capacity planning, full‑link stress testing, and adaptive traffic control ensures Alibaba’s large‑scale events remain highly available and performant.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
