Meituan’s Migration from OpenStack to Kubernetes: Large‑Scale Cloud‑Native Infrastructure, Challenges and Practices
Meituan migrated its massive cloud infrastructure from OpenStack to Kubernetes, containerizing over 98 % of services and implementing custom scheduling, NUMA‑aware placement, fine‑grained resource isolation, and an internal management platform that boosted stability above 99.99 %, cut costs, and paved the way for unified VM‑container scheduling and broader cloud‑native workloads.
Kubernetes has become the core management engine of Meituan Cloud’s infrastructure, delivering efficient resource management, cost reduction, and a solid foundation for cloud‑native architectures such as Serverless and distributed databases.
1. Background and Current Status
Kubernetes is the de‑facto standard for large‑scale container orchestration. Meituan started building its cloud platform on virtualization in 2013, introduced a container platform (Hulk 1.0) in 2016, evolved to Hulk 2.0 based on Kubernetes in 2018, and completed the containerization of its entire infrastructure by the end of 2019. By 2020 the containerization rate reached over 98%, with dozens of clusters, tens of thousands of nodes and hundreds of thousands of Pods, while limiting the maximum single‑cluster size to 5 K nodes for disaster‑recovery considerations.
2. Transition from OpenStack to Kubernetes – Obstacles and Benefits
During the OpenStack era Meituan faced several problems:
Complex architecture making operations difficult.
Inconsistent environments before container images.
High resource overhead of virtualization (≈10% of host resources).
Long provisioning and reclamation cycles.
Severe resource waste during traffic peaks.
Hulk 1.0, built on top of OpenStack, alleviated many of these issues but introduced new challenges such as stability, capability gaps, limited scalability, and performance constraints.
By adopting native Kubernetes APIs in the new Hulk platform, Meituan decoupled application management from the control layer, leveraged Kubernetes’ powerful scheduling and resource management, and reduced operational costs while accelerating resource convergence.
2.1 Containerization Process and Challenges
Key challenges included:
Stability issues due to dual‑layer scheduling.
Limited capabilities and poor extensibility.
Poor scalability of the control plane.
Performance bottlenecks and interference caused by weak isolation.
To address these, Meituan introduced a strategy engine for custom scheduling policies, a reuse‑based container restart strategy, Numa‑aware placement, and fine‑grained resource isolation for CPU, memory, and disk.
2.2 Advanced Scheduling and Operations
Meituan supports heterogeneous workloads (SSD, high‑memory, high‑IO, etc.) and custom dispersion strategies (e.g., rack, service dependencies). A policy engine allows applications to declare requirements via APPKEY, automatically tags Pods, and enforces the policies in Kubernetes.
Resource isolation is achieved through dedicated CPU sets, exclusive disk allocation, and per‑cluster resource pools, enabling precise control over performance‑sensitive services.
2.3 Platform‑Level Containerization (e.g., MySQL)
For database workloads, Meituan applied exclusive CPU allocation, custom swap sizing, Numa/Cache disabling, and dedicated disk IOPS isolation, resulting in a 60‑fold improvement in delivery efficiency and better performance than bare‑metal.
2.4 Benefits after Migration
98% of company services containerized, improving resource efficiency and stability.
Kubernetes stability >99.99%.
Kubernetes adopted as the standard cluster management platform.
3. Operating Massive Kubernetes Clusters – Challenges and Solutions
3.1 Core Component Optimization
Early clusters ran Kubernetes 1.6 and suffered from poor scheduling performance and “avalanche” failures at 5 K nodes. Optimizations were made to kube‑apiserver (multi‑level traffic control, reduced List calls), kube‑scheduler (pre‑selection and local‑optimal strategies, now upstream), etcd (separate event cluster, high‑performance SSD), and container layer (container reuse, pre‑mounted disks).
3.2 Platformization and Operational Efficiency
Meituan built an internal Kubernetes management platform that standardizes and visualizes operations, implements alarm self‑healing, automates inspections, and reduces manual error. Operational data drives fine‑grained scheduling and failure prediction.
3.3 Risk Control and Reliability Assurance
A five‑layer risk control chain (metrics, alerts, tools, mechanisms & measures, personnel) is in place. Regular health checks, disaster‑recovery drills, and closed‑loop testing ensure high reliability.
4. Summary and Future Outlook
Key takeaways: stay compatible with upstream Kubernetes APIs, extend via plugins rather than core changes, adopt community features judiciously, and focus on user pain points. Future directions include unified scheduling for VMs and containers, VPA‑driven resource efficiency, broader cloud‑native application management, and extending cloud‑native architectures to middleware, storage, big‑data, and search services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
