How Partitioned Synchronization Scales Alibaba’s Massive Cloud Clusters
At USENIX ATC2021, Alibaba Cloud’s Fuxi 2.0 team presented a best‑paper‑award research showing how a partitioned‑synchronization (ParSync) scheduling architecture dramatically reduces conflicts and latency in ultra‑large production clusters, balancing efficiency, quality, and fairness without adding resources.
Introduction
At the 2021 USENIX ATC conference, Alibaba Cloud’s Feitian Fuxi team and the Chinese University of Hong Kong presented the paper Scaling Large Production Clusters with Partitioned Synchronization . The paper was accepted and selected as one of the three Best Papers, marking the first time a Chinese company received this honor.
Paper Background
AI and big‑data workloads drive rapid growth in compute demand, pushing cloud clusters beyond ten‑thousand nodes. A typical production cluster may contain up to 100,000 machines and execute billions of short‑lived tasks daily. The scheduler, which matches multidimensional resource requests to machines, becomes a scalability bottleneck as cluster size and concurrency increase.
Current Situation Analysis
In Alibaba’s production environment, daily job counts range from 3.34 M to 4.36 M, with each job comprising many tasks (3.1 B–4.4 B per day). Over 87 % of tasks finish within 10 seconds, creating a massive scheduling load.
Traditional single‑master architectures cannot handle such scale because heartbeat latency grows and scheduling complexity explodes. Static partitioning of a large cluster into smaller ones introduces resource fragmentation and operational overhead.
Scheduling Goals and Challenges
Scheduling efficiency (low latency)
Scheduling quality (resource constraints, data locality, hardware preferences)
Fairness and priority among tenants
High resource utilization
These goals often conflict: improving quality may increase latency, and strict fairness can reduce utilization.
Theoretical Overview
The authors surveyed existing multi‑scheduler models and selected the Omega shared‑state approach for further analysis. They modeled scheduling conflicts and derived the expected conflict:
In the formula, Y_i is the expected conflict for a scheduler slot, N is the number of schedulers, K is each scheduler’s processing capacity, and S is the number of schedulable slots. Increasing S or N reduces conflict probability, but adding more schedulers also raises the raw number of conflicts; however, each scheduler’s load decreases proportionally, ultimately lowering overall conflict.
The analysis proves that with more than one scheduler, conflicts cannot be completely eliminated.
Impact of Different Factors
Experimental graphs illustrate how task pressure (R), synchronization delay (G), machine score variance (V), and partition count affect conflicts.
Key observations: higher task arrival rates require more slots to keep conflict stable; larger synchronization delays have a similar effect; machine performance variance also increases conflict; the number of partitions has negligible impact.
Proposed Solution: Partitioned Synchronization (ParSync)
To eliminate conflicts, the authors first discuss a pessimistic lock strategy, which guarantees no conflict but wastes resources, and a lock‑contention approach, which suffers from high overhead under massive concurrency.
ParSync divides the cluster into P partitions (P > N) and lets each scheduler update a different partition in a round‑robin fashion. This reduces synchronization latency for the active partition and avoids conflicts because no two schedulers sync the same partition simultaneously.
Three scheduling policies are built on ParSync:
Latency‑first : prioritize the most recently synchronized partition.
Quality‑first : prioritize machines with the highest score.
Adaptive : start with quality‑first and switch to latency‑first when waiting time exceeds a threshold.
Implementation and Experiments
The authors implemented the three policies and evaluated them in a “wind‑tunnel” testbed that simulates 20 % of a real cluster (1:500 scale). The environment includes 20 schedulers, 2 resource managers, 200 k slots, and a scheduler capacity of 40 k tasks/s. Workloads are applied in three phases with varying pressure (50 %, 80 %, 95 % of capacity).
Results show that ParSync’s latency‑first policy keeps scheduling latency low across all phases, while quality‑first maintains high scheduling quality comparable to the baseline StateSync (Omega). The adaptive policy offers a balanced trade‑off between latency and quality.
Conclusion
The paper first surveys common multi‑scheduler designs and identifies conflict as a major issue in shared‑state models. By analytically and experimentally demonstrating that increasing resources or scheduler count reduces conflict, the authors propose ParSync as a practical solution that requires no additional hardware. ParSync achieves comparable quality to existing approaches while significantly improving latency, making it suitable for ultra‑large production clusters.
Appendix
Production cluster resource scheduling architecture diagram.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
