Meituan's Cloud‑Native Cluster Scheduling System: Design, Challenges, and Future Directions
Meituan’s cloud‑native cluster scheduling system, built on a customized Kubernetes engine, unifies multi‑cluster management, improves CPU utilization, reduces costs, and enhances stability by balancing throughput, complexity, and reliability while addressing large‑scale deployment, fault‑tolerance, and dynamic resource allocation challenges.
This article presents Meituan's practice in solving large‑scale cluster management and designing an efficient cluster scheduler, focusing on cloud‑native technologies such as Kubernetes. It outlines the problems, challenges, and strategies Meituan adopted when deploying cloud‑native solutions.
Introduction
Cluster schedulers are critical in data‑center operations. As cluster size and application count grow, developers face increasing complexity. The article aims to answer how to manage massive clusters, design a high‑quality scheduler, ensure stability, reduce cost, and improve efficiency.
Cluster Scheduler Overview
A cluster scheduler (or data‑center resource scheduler) allocates resources and schedules tasks. Well‑known systems include OpenStack, YARN, Mesos, Kubernetes, Google Borg, Microsoft Apollo, Baidu Matrix, and Alibaba Fuxi.
Challenges of Large‑Scale Cluster Management
Two core difficulties are handling massive deployments across data centers and building a cloud‑native operating system that improves compute service experience.
How to manage large‑scale deployments with elastic, high‑utilization scheduling while preserving service quality.
How to transform the underlying infrastructure into a cloud‑native OS that automates disaster recovery, deployment, and upgrades.
Operational Challenges
Four major challenges are:
Meeting diverse user demands quickly while keeping the platform generic.
Improving resource utilization without sacrificing QoS.
Providing automatic fault handling for stateful services across multi‑data‑center or multi‑cloud environments.
Managing the complexity and stability risks of very large or numerous clusters.
Design Trade‑offs
When designing a scheduler, trade‑offs include:
Throughput vs. scheduling quality – quality is prioritized for long‑running services.
Architectural complexity vs. scalability – more features increase complexity.
Reliability vs. single‑cluster size – larger clusters raise failure impact.
Scheduler Architecture Classification
Schedulers can be classified as monolithic, two‑level, shared‑state, distributed, or hybrid. Each has strengths and weaknesses depending on workload characteristics.
Meituan's Scheduler Evolution
Meituan migrated from OpenStack to Kubernetes, achieving >98% containerization by the end of 2019, yet still faced low resource utilization and high operational cost. The new system focuses on stability, cost reduction, and efficiency.
Stability: improve robustness, observability, decouple modules, and enhance multi‑cluster automation.
Cost Reduction: optimize scheduling models, shift from static to dynamic allocation, and increase CPU utilization.
Efficiency: enable self‑service policy adjustments, support PaaS components, and streamline operations.
Multi‑Cluster Unified Scheduling
By unifying scheduling across clusters, Meituan increased CPU utilization by ~10 percentage points, reduced hotspot hosts, and improved resource fragmentation.
Scheduling Engine Service (MKE)
Meituan built a customized Kubernetes engine (MKE) that enhances cluster operations, provides self‑healing, alerting, and integrates with PaaS services. It also offers a unified scheduling and orchestration framework.
Future Outlook – Cloud‑Native Operating System
Future work includes application‑centric delivery management, edge‑computing solutions, and mixed‑workload (online + offline) capabilities to evolve toward a cloud‑native OS.
Conclusion
Meituan’s scheduler balances throughput, complexity, and reliability through multi‑cluster unified scheduling, dynamic resource models, and a strong Kubernetes foundation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
