Cloud Computing 19 min read

How Alibaba Scaled Double 11: Inside the Cloud Architecture Evolution

Over nine years, Alibaba transformed its Double 11 e‑commerce platform from a centralized system to a highly elastic, cloud‑native architecture, employing distributed design, multi‑active regions, unified scheduling, containerization with Pouch, and hybrid deployment to dramatically cut costs and boost peak throughput.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba Scaled Double 11: Inside the Cloud Architecture Evolution

Alibaba's Double 11 sales grew 280× in transaction volume and over 800× in peak traffic in nine years, turning the event into a massive scalability and stability challenge that required exponential improvements in system complexity and cost efficiency.

Key challenges included internet‑scale traffic, enterprise‑level service complexity, financial‑grade transaction integrity, and peak loads that were dozens of times higher than normal traffic.

Cloud Architecture Evolution

Starting in 2008, Alibaba shifted from a centralized architecture to a distributed, scalable design, introducing massive middleware development. By 2013, a multi‑active, region‑level deployment allowed the entire transaction unit to run across nationwide data centers, solving scalability issues.

Full‑link pressure testing was introduced in 2013 to simulate Double 11 traffic on the production environment, exposing and fixing bottlenecks before the event.

To reduce hardware and manpower costs, a cloud‑native architecture was adopted, enabling elastic resource reuse across clusters. Resources were divided into online service clusters, compute task clusters, and ECS clusters, each with independent scheduling and resource management.

Unified Scheduling System

The Sigma scheduler, built since 2011, provides a three‑layer brain: Alikenel (kernel enhancements for flexible CPU and priority allocation), SigmaSlave (per‑node CPU handling for latency‑sensitive tasks), and SigmaMaster (global resource allocation and algorithmic optimization). The system was rewritten in Go in 2016 and made compatible with the Kubernetes API in 2017.

Unified scheduling and centralized management improved resource utilization by over 5% across the massive scale of Double 11.

Key Technologies of Mixed Deployment

Mixed deployment (混部) combines online services and compute tasks on the same physical machines, achieving higher overall utilization. Critical techniques include:

CPU hyper‑thread isolation with Noise Clean kernel feature.

Enhanced CFS scheduling with Task Preempt for higher online task priority.

Cache isolation via CAT for LLC channel separation.

Memory isolation using CGroup, OOM priority, and bandwidth control.

Elastic memory that allows offline tasks to over‑commit and release memory on demand.

Network QoS tiered bandwidth guarantees (gold, silver, bronze).

Online cluster management further applies profiling, affinity, priority policies, and automatic shrink/expand to balance stability‑first strategies during Double 11 peaks and utilization‑first strategies during normal operation.

Compute Task Scheduling & ODPS

Elastic memory time‑sharing, dynamic memory over‑commit, and both lossless and lossy degradation strategies enable compute tasks to coexist with online services, raising average CPU utilization from 10% to over 40% and saving more than 30% of server capacity.

Pouch Container and Containerization Progress

Pouch, Alibaba's internal container runtime built on LXC since 2011, incorporated Docker image support in 2015 and became open‑source in 2017. It offers strong isolation, login capability, and a P2P image distribution mechanism, supporting standards such as RunC, RunV, and RunLXC.

Pouch now powers millions of containers across Alibaba's business units, achieving full containerization of online services and expanding to compute workloads.

Storage‑Compute Separation

To avoid performance penalties from stateful tasks, Alibaba introduced a storage‑compute separation layer, using a caching bridge to decouple data movement between online and compute clusters, and adopting industry‑standard CSI for distributed storage integration.

Future Cloud Architecture Roadmap

The hybrid cloud elastic architecture enables minute‑level scaling, rapid provisioning of transaction units, and seconds‑level health checks, reducing resource holding time and non‑online server time by over 30% and cutting Double 11 infrastructure cost by 50%.

Future work focuses on further efficiency gains through AI‑driven decision making, smarter resource prediction, and continued expansion of unified scheduling and mixed deployment at larger scales.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibabacontainerizationresource scheduling
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.