Cloud Native 21 min read

Meituan's Cloud‑Native Cluster Scheduling System: Design, Challenges, and Future Directions

Meituan’s cloud‑native cluster scheduling system, built on a customized Kubernetes engine, unifies multi‑cluster management, improves CPU utilization, reduces costs, and enhances stability by balancing throughput, complexity, and reliability while addressing large‑scale deployment, fault‑tolerance, and dynamic resource allocation challenges.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Meituan's Cloud‑Native Cluster Scheduling System: Design, Challenges, and Future Directions

This article presents Meituan's practice in solving large‑scale cluster management and designing an efficient cluster scheduler, focusing on cloud‑native technologies such as Kubernetes. It outlines the problems, challenges, and strategies Meituan adopted when deploying cloud‑native solutions.

Introduction

Cluster schedulers are critical in data‑center operations. As cluster size and application count grow, developers face increasing complexity. The article aims to answer how to manage massive clusters, design a high‑quality scheduler, ensure stability, reduce cost, and improve efficiency.

Cluster Scheduler Overview

A cluster scheduler (or data‑center resource scheduler) allocates resources and schedules tasks. Well‑known systems include OpenStack, YARN, Mesos, Kubernetes, Google Borg, Microsoft Apollo, Baidu Matrix, and Alibaba Fuxi.

Challenges of Large‑Scale Cluster Management

Two core difficulties are handling massive deployments across data centers and building a cloud‑native operating system that improves compute service experience.

How to manage large‑scale deployments with elastic, high‑utilization scheduling while preserving service quality.

How to transform the underlying infrastructure into a cloud‑native OS that automates disaster recovery, deployment, and upgrades.

Operational Challenges

Four major challenges are:

Meeting diverse user demands quickly while keeping the platform generic.

Improving resource utilization without sacrificing QoS.

Providing automatic fault handling for stateful services across multi‑data‑center or multi‑cloud environments.

Managing the complexity and stability risks of very large or numerous clusters.

Design Trade‑offs

When designing a scheduler, trade‑offs include:

Throughput vs. scheduling quality – quality is prioritized for long‑running services.

Architectural complexity vs. scalability – more features increase complexity.

Reliability vs. single‑cluster size – larger clusters raise failure impact.

Scheduler Architecture Classification

Schedulers can be classified as monolithic, two‑level, shared‑state, distributed, or hybrid. Each has strengths and weaknesses depending on workload characteristics.

Meituan's Scheduler Evolution

Meituan migrated from OpenStack to Kubernetes, achieving >98% containerization by the end of 2019, yet still faced low resource utilization and high operational cost. The new system focuses on stability, cost reduction, and efficiency.

Stability: improve robustness, observability, decouple modules, and enhance multi‑cluster automation.

Cost Reduction: optimize scheduling models, shift from static to dynamic allocation, and increase CPU utilization.

Efficiency: enable self‑service policy adjustments, support PaaS components, and streamline operations.

Multi‑Cluster Unified Scheduling

By unifying scheduling across clusters, Meituan increased CPU utilization by ~10 percentage points, reduced hotspot hosts, and improved resource fragmentation.

Scheduling Engine Service (MKE)

Meituan built a customized Kubernetes engine (MKE) that enhances cluster operations, provides self‑healing, alerting, and integrates with PaaS services. It also offers a unified scheduling and orchestration framework.

Future Outlook – Cloud‑Native Operating System

Future work includes application‑centric delivery management, edge‑computing solutions, and mixed‑workload (online + offline) capabilities to evolve toward a cloud‑native OS.

Conclusion

Meituan’s scheduler balances throughput, complexity, and reliability through multi‑cluster unified scheduling, dynamic resource models, and a strong Kubernetes foundation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKuberneteslarge-scale systemsMeituanCluster Scheduling
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.