Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices
The article examines Alibaba's Zeus resource scheduling platform, detailing its background, problem analysis, container‑based virtualization, distributed architecture, strategies for improving resource utilization such as overselling and hybrid deployment, as well as stability measures and automation for large‑scale operations.
This article provides a comprehensive overview of Alibaba's Zeus resource scheduling system, covering its background, problem analysis, required knowledge, engineering practice, objectives, and lessons learned, focusing on practical details rather than groundbreaking architectural innovations.
Background analysis reveals several sources of resource waste in large‑scale data centers, including chaotic resource management, mismatches between requested and actual usage, idle capacity during traffic lows, hardware failures, and inefficiencies caused by layered system dependencies.
Zeus is introduced as a unified scheduler that abstracts physical server resources, handles application allocation requests, reduces costs through overselling and hybrid deployment, and enhances system stability by monitoring and isolating faulty hardware.
Virtualization is achieved with LXC containers, which partition and isolate CPU, memory, disk, and network resources, offering API‑driven allocation and consistent data for forecasting while acknowledging container limitations such as isolation challenges and tool gaps.
The technical architecture follows a distributed two‑layer model with an idle‑first allocation policy, incorporating constraints like load‑balancing and overselling; it is implemented in Golang and supports both online services and offline jobs through mixed deployment and coordinated surrounding systems.
Resource‑utilization techniques include dynamic overselling coefficients to balance waste and competition, hybrid deployment that merges online services with offline jobs to improve host utilization, and a balanced approach between whole‑host packing and scattered allocation to reduce fragmentation.
Stability is ensured through real‑time fault detection, automatic remediation, blacklist mechanisms, and various scheduling strategies illustrated in the accompanying diagram, with special provisions for large‑scale promotional events such as Double‑11.
Operations automation is highlighted by automated fault handling, reduced manual intervention, enhanced scaling success rates, and multi‑dimensional tools that simplify resource control for operators, ultimately delivering significant cost savings and reliability improvements.
The content is sourced from the WeChat public account “蝙蝠遐想” and concludes with a call to follow the account for more technical insights.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
