Operations 7 min read

How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%

The article explains how rapid internet growth has expanded data centers, why traditional operations fall short, presents a simple utilization formula, shows Alibaba’s mixed offline‑online scheduling experiment that raised server usage from 10% to over 40%, and announces an open dataset for academic research.

Alibaba Cloud Developer

Sep 6, 2017

How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%

Over the past 20 years, especially the last decade of mobile internet and the "Internet+" wave, internet technology has permeated every industry and aspect of daily life, leading to a massive increase in both service and data scale. This rapid growth has caused data centers to expand dramatically, and traditional operations can no longer meet the scaling demands, prompting the emergence of automated cluster management systems.

These systems share a common goal: improve machine utilization in data centers. Even a small increase in average utilization can yield substantial cost savings. For example, with N servers improving from utilization R1 to R2, the number of servers saved X can be calculated as:

N*R1 = (N-X)*R2

=> X*R2 = N*R2 – N*R1

=> X = N*(R2‑R1)/R2

If a data center has 100,000 servers and utilization rises from 28% to 40%, the formula gives X = 100,000 * (40‑28)/40 = 30,000 servers, saving roughly 6 billion RMB (assuming 20,000 CNY per server).

However, studies by Gartner and McKinsey show global server utilization is only 6%‑12%; even with virtualization it reaches only 7%‑17%, highlighting the inefficiency of traditional operations. Fine‑grained resource scheduling and virtualization (VMs or containers) can increase utilization, but high‑density deployments introduce resource contention that raises latency for online services, which directly impacts user churn and revenue.

To explore whether mixing latency‑insensitive batch jobs with latency‑sensitive online services could improve overall utilization without harming online performance, Alibaba began experiments in 2015. Previously, Alibaba operated separate schedulers: Fuxi (offline, process‑based) and Sigma (online, container‑based). The mixed deployment placed batch tasks on the same machines as online services, allowing idle online resources to be used by offline jobs.

The two‑year trial, including architectural adjustments and resource isolation, moved to large‑scale production, serving core e‑commerce and big‑data (ODPS) workloads. After mixing, average online machine utilization rose from around 10% to over 40% while still meeting online SLOs.

Recognizing the gap between academic research (often limited in scale and data realism) and production environments, Alibaba now releases a subset of this mixed‑deployment cluster data to the research community: 1,000 servers monitored for 12 hours, with format details and download links on GitHub ( https://github.com/alibaba/clusterdata ). Researchers are invited to use this real‑world dataset to develop better scheduling and cluster management methods.

For questions or suggestions, please contact [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba resource scheduling Cluster Management mixed deployment data center utilization

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.