Operations 16 min read

How Alibaba’s Co‑Location (Mixed‑Deployment) Cuts Costs and Boosts Utilization

Alibaba’s mixed‑deployment (Co‑location) technology combines online services and batch compute tasks on shared physical resources, using priority‑based scheduling, resource isolation, and dynamic memory management to dramatically improve CPU utilization, cut infrastructure costs, and maintain service level objectives during peak traffic events.

Alibaba Cloud Developer

Feb 12, 2018

How Alibaba’s Co‑Location (Mixed‑Deployment) Cuts Costs and Boosts Utilization

Mixed‑Deployment (Co‑location) Overview

Mixed‑deployment, or Co‑location, mixes different types of workloads on the same physical resources, scheduling online services and batch compute tasks together while ensuring service‑level objectives (SLOs) and significantly reducing costs.

Background

During major traffic peaks such as Alibaba’s Double‑11 shopping festival, massive compute resources are required, yet they remain idle for most of the day. Global server CPU utilization is only 6‑12%, and even with virtualization it reaches merely 7‑17%. Alibaba’s online services average about 10% utilization.

Conversely, big‑data processing frameworks (Hadoop, Spark, Flink, TensorFlow, etc.) generate high‑CPU workloads that peak at night, often exceeding 50‑60% CPU usage. These workloads are typically isolated in separate clusters.

Motivation for Mixing Clusters

Just as tidal traffic patterns cause directional congestion, online services experience low load at night and high load during the day, while batch jobs show the opposite pattern. By allowing low‑priority batch tasks to run on idle online‑service resources, overall utilization improves dramatically.

Key Characteristics of Mixed‑Deployment

Priority Separation: Low‑priority batch tasks can be pre‑empted without affecting high‑priority online services.

Resource Complementarity: Online services are low‑utilization during the day and high during peaks; batch jobs are high during off‑peak hours, enabling complementary scheduling.

Cost Savings Example

Assuming a data center with N servers, increasing average utilization from R1 to R2 saves X = N * (R2 - R1) / R2 servers.

N*R1 = (N-X)*R2
=> X*R2 = N*R2 – N*R1
=> X = N*(R2-R1)/R2

For 100,000 servers, raising utilization from 28% to 40% saves roughly 30,000 machines, equating to about 600 million RMB in cost.

Historical Timeline

2014: Technical feasibility studies and design.

2015: Testing environment setup; identified scheduling, isolation, storage, and memory challenges.

2016: Small‑scale production validation with ~200 nodes.

2017: Full‑scale production; ~20% of Double‑11 traffic ran on mixed‑deployment clusters.

Architecture of Mixed‑Deployment Scheduling

Two independent schedulers run side‑by‑side:

Sigma: Manages online service containers, compatible with Kubernetes APIs and Alibaba’s OCI‑compatible Pouch containers.

Fuxi: Handles massive data‑processing jobs, supporting MapReduce‑style pipelines, high parallelism, and fault tolerance.

A zero‑layer coordination layer mediates resource allocation between Sigma and Fuxi.

Resource Isolation Mechanisms

CPU Scheduling Optimization : CGroup priority settings allow high‑priority tasks to pre‑empt low‑priority ones; hyper‑threading noise is avoided.

L3 Cache Isolation : Uses Intel CAT to limit cache usage of low‑priority tasks.

Memory Bandwidth Isolation : Monitors bandwidth and adjusts CFS bandwidth control to favor high‑priority workloads.

Memory Protection : Separate CGroup memory reclamation, OOM killing prioritizes low‑priority batch tasks.

IO Isolation : File‑level bandwidth caps, metadata throttling, and tiered bandwidth sharing (gold, silver, bronze).

Network Flow Control : Host‑level bandwidth isolation (TC) and container‑level bandwidth sharing.

Future Directions

Mixed‑deployment will evolve toward finer‑grained scheduling, support for GPUs and FPGAs, scaling to million‑core clusters, and deeper integration of machine‑learning‑driven resource profiling. The goal is to make mixed‑deployment a universal scheduling capability across all resource types.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cost Optimization resource scheduling cloud infrastructure Co-location cluster utilization

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.