Cloud Computing 17 min read

How Partitioned Synchronization Scales Alibaba’s Massive Cloud Clusters

At USENIX ATC2021, Alibaba Cloud’s Fuxi 2.0 team presented a best‑paper‑award research showing how a partitioned‑synchronization (ParSync) scheduling architecture dramatically reduces conflicts and latency in ultra‑large production clusters, balancing efficiency, quality, and fairness without adding resources.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Partitioned Synchronization Scales Alibaba’s Massive Cloud Clusters

Introduction

At the 2021 USENIX ATC conference, Alibaba Cloud’s Feitian Fuxi team and the Chinese University of Hong Kong presented the paper Scaling Large Production Clusters with Partitioned Synchronization . The paper was accepted and selected as one of the three Best Papers, marking the first time a Chinese company received this honor.

Paper Background

AI and big‑data workloads drive rapid growth in compute demand, pushing cloud clusters beyond ten‑thousand nodes. A typical production cluster may contain up to 100,000 machines and execute billions of short‑lived tasks daily. The scheduler, which matches multidimensional resource requests to machines, becomes a scalability bottleneck as cluster size and concurrency increase.

Current Situation Analysis

In Alibaba’s production environment, daily job counts range from 3.34 M to 4.36 M, with each job comprising many tasks (3.1 B–4.4 B per day). Over 87 % of tasks finish within 10 seconds, creating a massive scheduling load.

Traditional single‑master architectures cannot handle such scale because heartbeat latency grows and scheduling complexity explodes. Static partitioning of a large cluster into smaller ones introduces resource fragmentation and operational overhead.

Scheduling Goals and Challenges

Scheduling efficiency (low latency)

Scheduling quality (resource constraints, data locality, hardware preferences)

Fairness and priority among tenants

High resource utilization

These goals often conflict: improving quality may increase latency, and strict fairness can reduce utilization.

Theoretical Overview

The authors surveyed existing multi‑scheduler models and selected the Omega shared‑state approach for further analysis. They modeled scheduling conflicts and derived the expected conflict:

In the formula, Y_i is the expected conflict for a scheduler slot, N is the number of schedulers, K is each scheduler’s processing capacity, and S is the number of schedulable slots. Increasing S or N reduces conflict probability, but adding more schedulers also raises the raw number of conflicts; however, each scheduler’s load decreases proportionally, ultimately lowering overall conflict.

The analysis proves that with more than one scheduler, conflicts cannot be completely eliminated.

Impact of Different Factors

Experimental graphs illustrate how task pressure (R), synchronization delay (G), machine score variance (V), and partition count affect conflicts.

Key observations: higher task arrival rates require more slots to keep conflict stable; larger synchronization delays have a similar effect; machine performance variance also increases conflict; the number of partitions has negligible impact.

Proposed Solution: Partitioned Synchronization (ParSync)

To eliminate conflicts, the authors first discuss a pessimistic lock strategy, which guarantees no conflict but wastes resources, and a lock‑contention approach, which suffers from high overhead under massive concurrency.

ParSync divides the cluster into P partitions (P > N) and lets each scheduler update a different partition in a round‑robin fashion. This reduces synchronization latency for the active partition and avoids conflicts because no two schedulers sync the same partition simultaneously.

Three scheduling policies are built on ParSync:

Latency‑first : prioritize the most recently synchronized partition.

Quality‑first : prioritize machines with the highest score.

Adaptive : start with quality‑first and switch to latency‑first when waiting time exceeds a threshold.

Implementation and Experiments

The authors implemented the three policies and evaluated them in a “wind‑tunnel” testbed that simulates 20 % of a real cluster (1:500 scale). The environment includes 20 schedulers, 2 resource managers, 200 k slots, and a scheduler capacity of 40 k tasks/s. Workloads are applied in three phases with varying pressure (50 %, 80 %, 95 % of capacity).

Results show that ParSync’s latency‑first policy keeps scheduling latency low across all phases, while quality‑first maintains high scheduling quality comparable to the baseline StateSync (Omega). The adaptive policy offers a balanced trade‑off between latency and quality.

Conclusion

The paper first surveys common multi‑scheduler designs and identifies conflict as a major issue in shared‑state models. By analytically and experimentally demonstrating that increasing resources or scheduler count reduces conflict, the authors propose ParSync as a practical solution that requires no additional hardware. ParSync achieves comparable quality to existing approaches while significantly improving latency, making it suitable for ultra‑large production clusters.

Appendix

Production cluster resource scheduling architecture diagram.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingResource Managementlarge-scale systemsCluster Scheduling
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.