Operations 18 min read

How Alibaba Scales Resource Operations for Massive Events like Double 11

In this talk, Alibaba engineer Yang Yi explains the evolution of resource operation and DevOps at Alibaba, covering the shift from manual tasks to containerized, automated platforms, the challenges of large‑scale scheduling, cost reduction strategies for events such as Double 11, and the move toward intelligent, ops‑less infrastructure.

Efficient Ops
Efficient Ops
Efficient Ops
How Alibaba Scales Resource Operations for Massive Events like Double 11

Lecturer | Yang Yi Editor | Huang Xiaoxuan

Lecturer Biography

Yang Yi Joined Alibaba in 2010 as a monitoring system operations developer. Moved to the operations team in 2012, responsible for core transaction system operations. In 2016 led the resource operation team of the Systems Software Division, handling resource management for Alibaba's online services. Has extensive experience in large‑scale event operations (e.g., Double 11), automation, DevOps, and resource operation.

Preface

Hello, I am Yang Yi. Today I will share insights on resource operation, covering four main topics: the evolution of resource operation, large‑scale operation platforms, cost reduction, and intelligent operation.

Evolution of Resource Operation

Alibaba’s DevOps journey began with manual, keyboard‑driven operations. As business scale grew, tools were introduced to automate repetitive tasks, leading to a DevOps transformation.

Over the past few years, Alibaba’s online operations have reached an initial level of automation and are now exploring intelligent operations. Containerization, especially Docker, has become a milestone in this transformation.

The future direction is "Opsless" – moving from traditional operations to a NoOps state where manual operations become unnecessary.

Operational granularity has evolved from single‑object management (configuration files, packages) to container‑level, then Pod‑level, and eventually to a "Box" level for Alibaba’s massive and diverse workloads.

Two breakthrough directions for Alibaba operations:

1. Deep integration with business to enable intelligent monitoring and fault localization. 2. Diving into scheduling and kernel‑level resource control.

Resource operation aims to manage ultra‑large data‑center resources to lower overall cost across Alibaba’s ecosystem.

Challenges in Resource Operation

Building unified scheduling capabilities for diverse e‑commerce services (Taobao, Tmall, etc.).

Creating a centralized, flat resource supply‑demand model as a middle‑platform service.

Improving resource utilization; Alibaba’s online clusters typically run at about 10% CPU utilization, leaving huge optimization space.

Scalable Operation Platform

Alibaba has over 20,000 engineers, each responsible for a system, resulting in tens of thousands of systems.

The platform improves resource operation efficiency through:

Resource scheduling system (Sigma scheduler) that delivers resources with a few clicks.

Budget and quota management, where each team submits annual resource plans for centralized allocation.

Elastic scaling to handle massive capacity bursts.

Large‑scale execution, e.g., expanding tens of thousands of containers during Double 11.

The architecture consists of a budgeting layer, capacity planning, elastic scaling, and a decision center that processes data for large‑scale actions such as container termination.

All internal systems have resource caps (CPU cores, disk space, etc.), forming a closed‑loop quota ecosystem.

Reducing Resource Cost

Alibaba faces huge budget pressure for infrastructure, especially during Double 11 when peak transaction capacity must be provisioned.

Two main cost challenges:

Massive peak demand during Double 11, requiring hundreds of thousands of machines.

Offline batch jobs that consume resources after the peak, leading to idle capacity.

Solutions include leveraging Alibaba Cloud to quickly provision resources (often within 10‑20 days) and mixing online and offline workloads on the same physical machines to raise overall CPU utilization.

Since 2015, Alibaba has been deploying hybrid workloads: online services run on physical machines, while offline tasks can also use those machines when idle, achieving high CPU utilization while keeping offline impact on online services below 10%.

Intelligent Operation

Mixed‑deployment technology allows rapid suspension of low‑priority offline tasks during online traffic spikes.

Online and offline schedulers cooperate; online traffic takes precedence, and resource isolation ensures online services receive guaranteed resources.

Another innovation is Cpushare, an application‑profiling technique that slices CPU time into fine‑grained units, allowing allocation of fractional cores (e.g., 0.1 CPU) to low‑usage services, improving overall utilization.

Containers can be given resource ceilings and floors, enabling dynamic scaling based on demand while preserving isolation.

These technologies enable Alibaba to meet Double 11 demand with minimal additional hardware procurement.

Q&A

Q1: How do you handle budget changes when many departments submit annual plans?

A: We aggregate all Alibaba resources, set a daily price per server, and adjust allocations based on BU approvals. Our KPI focuses on increasing revenue through efficient resource reuse rather than raising unit prices.

Q2: Is Alibaba Cloud’s resource pool shared with internal services during peak periods?

A: Internal services use Alibaba Cloud’s reserved capacity, not public‑cloud customers’ resources, ensuring consistent performance.

Q3: What happens when sudden resource pressure occurs despite the low 10% utilization rate?

A: We maintain a centralized buffer (a “checkbook”) that can be drawn upon during spikes, similar to a bank’s liquidity during a run.

Q4: Who is responsible for OS, middleware, and database maintenance in the resource pool?

A: After containerization, most routine operations are handed to developers. Dedicated teams still manage middleware and distributed storage, but overall operational load is far reduced.

Read the original article for more details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlibabaResource ManagementScalable Operations
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.