Cloud Native 21 min read

Meituan’s Migration from OpenStack to Kubernetes: Large‑Scale Cloud‑Native Infrastructure, Challenges and Practices

Meituan migrated its massive cloud infrastructure from OpenStack to Kubernetes, containerizing over 98 % of services and implementing custom scheduling, NUMA‑aware placement, fine‑grained resource isolation, and an internal management platform that boosted stability above 99.99 %, cut costs, and paved the way for unified VM‑container scheduling and broader cloud‑native workloads.

Meituan Technology Team

Aug 13, 2020

Meituan’s Migration from OpenStack to Kubernetes: Large‑Scale Cloud‑Native Infrastructure, Challenges and Practices

Kubernetes has become the core management engine of Meituan Cloud’s infrastructure, delivering efficient resource management, cost reduction, and a solid foundation for cloud‑native architectures such as Serverless and distributed databases.

1. Background and Current Status

Kubernetes is the de‑facto standard for large‑scale container orchestration. Meituan started building its cloud platform on virtualization in 2013, introduced a container platform (Hulk 1.0) in 2016, evolved to Hulk 2.0 based on Kubernetes in 2018, and completed the containerization of its entire infrastructure by the end of 2019. By 2020 the containerization rate reached over 98%, with dozens of clusters, tens of thousands of nodes and hundreds of thousands of Pods, while limiting the maximum single‑cluster size to 5 K nodes for disaster‑recovery considerations.

2. Transition from OpenStack to Kubernetes – Obstacles and Benefits

During the OpenStack era Meituan faced several problems:

Complex architecture making operations difficult.

Inconsistent environments before container images.

High resource overhead of virtualization (≈10% of host resources).

Long provisioning and reclamation cycles.

Severe resource waste during traffic peaks.

Hulk 1.0, built on top of OpenStack, alleviated many of these issues but introduced new challenges such as stability, capability gaps, limited scalability, and performance constraints.

By adopting native Kubernetes APIs in the new Hulk platform, Meituan decoupled application management from the control layer, leveraged Kubernetes’ powerful scheduling and resource management, and reduced operational costs while accelerating resource convergence.

2.1 Containerization Process and Challenges

Key challenges included:

Stability issues due to dual‑layer scheduling.

Limited capabilities and poor extensibility.

Poor scalability of the control plane.

Performance bottlenecks and interference caused by weak isolation.

To address these, Meituan introduced a strategy engine for custom scheduling policies, a reuse‑based container restart strategy, Numa‑aware placement, and fine‑grained resource isolation for CPU, memory, and disk.

2.2 Advanced Scheduling and Operations

Meituan supports heterogeneous workloads (SSD, high‑memory, high‑IO, etc.) and custom dispersion strategies (e.g., rack, service dependencies). A policy engine allows applications to declare requirements via APPKEY, automatically tags Pods, and enforces the policies in Kubernetes.

Resource isolation is achieved through dedicated CPU sets, exclusive disk allocation, and per‑cluster resource pools, enabling precise control over performance‑sensitive services.

2.3 Platform‑Level Containerization (e.g., MySQL)

For database workloads, Meituan applied exclusive CPU allocation, custom swap sizing, Numa/Cache disabling, and dedicated disk IOPS isolation, resulting in a 60‑fold improvement in delivery efficiency and better performance than bare‑metal.

2.4 Benefits after Migration

98% of company services containerized, improving resource efficiency and stability.

Kubernetes stability >99.99%.

Kubernetes adopted as the standard cluster management platform.

3. Operating Massive Kubernetes Clusters – Challenges and Solutions

3.1 Core Component Optimization

Early clusters ran Kubernetes 1.6 and suffered from poor scheduling performance and “avalanche” failures at 5 K nodes. Optimizations were made to kube‑apiserver (multi‑level traffic control, reduced List calls), kube‑scheduler (pre‑selection and local‑optimal strategies, now upstream), etcd (separate event cluster, high‑performance SSD), and container layer (container reuse, pre‑mounted disks).

3.2 Platformization and Operational Efficiency

Meituan built an internal Kubernetes management platform that standardizes and visualizes operations, implements alarm self‑healing, automates inspections, and reduces manual error. Operational data drives fine‑grained scheduling and failure prediction.

3.3 Risk Control and Reliability Assurance

A five‑layer risk control chain (metrics, alerts, tools, mechanisms & measures, personnel) is in place. Regular health checks, disaster‑recovery drills, and closed‑loop testing ensure high reliability.

4. Summary and Future Outlook

Key takeaways: stay compatible with upstream Kubernetes APIs, extend via plugins rather than core changes, adopt community features judiciously, and focus on user pain points. Future directions include unified scheduling for VMs and containers, VPA‑driven resource efficiency, broader cloud‑native application management, and extending cloud‑native architectures to middleware, storage, big‑data, and search services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native kubernetes infrastructure container platform Meituan Large-Scale Operations

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.