Cloud Native 19 min read

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

Meituan‑Dianping’s evolution from virtualization to the HULK‑2.0 Kubernetes platform enables a 100,000‑instance, multi‑region cluster to achieve high elasticity and availability, using scheduler optimizations, local‑optimal placement, enhanced kubelet features, and fine‑grained resource management to maximize throughput during traffic spikes.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

Meituan-Dianping, a leading domestic lifestyle service platform, experiences pronounced traffic peaks during holidays and promotions, demanding high elasticity and availability of its cluster resources. The goal is to maximize throughput with limited resources.

The article introduces Meituan-Dianping's Kubernetes cluster management and usage practices, covering the evolution of its internal cluster management and scheduling system (HULK), Kubernetes management, optimization, and resource management.

HULK Cluster Management and Scheduling System

Since 2013, Meituan-Dianping progressed from traditional virtualization to Docker-based elastic scaling (HULK 1.0) and later adopted Kubernetes (HULK 2.0) to improve efficiency and reduce costs.

HULK 2.0 decouples business layers from the underlying Kubernetes platform via a unified HULK API, while remaining compatible with native Kubernetes APIs.

Kubernetes Management and Practice

Kubernetes was chosen for its robust architecture and extensibility, allowing Meituan-Dianping to build a platform that supports rapid deployment, dynamic scaling, and better resource allocation.

Current cluster scale exceeds 100,000 online instances across multiple regions, with features such as automated business monitoring, health alerts, periodic inspections, visualized metrics, and capacity planning.

Kube‑Scheduler Performance Optimization

Using Kubernetes 1.6, the scheduler’s throughput became a bottleneck for a ~3,000‑node cluster, with a single pod scheduling taking ~5 seconds. Optimizations increased scheduler performance by over 400%.

A new "pre‑filter abort" mechanism stops evaluating further predicates once a node fails a pre‑filter condition, reducing unnecessary computation and improving scheduling speed.

The optimization was contributed to the Kubernetes community (PR #56926) and introduced the alwaysCheckAllPredicates option, now default in Kubernetes 1.10.

Local‑Optimal Scheduling

Instead of exhaustive Best‑Fit traversal, the system selects a subset of candidate nodes (e.g., 100 out of 1,000) and chooses the highest‑scoring node within that subset, achieving comparable results with far less computation.

Kubelet Enhancements

Risk Control: Added policies to limit Kubelet‑initiated evictions.

Restart Strategies: Implemented “Reuse” and “Rebuild” strategies to preserve container state across host reboots.

IP Retention: Developed a custom CNI plugin that reuses pod IPs after migration or restart.

Resource Allocation: Enabled NUMA binding, CPU‑Set, and fine‑grained limits (ulimit, I/O, PID, swap).

In‑Place Upgrade: Allowed application updates without pod recreation, preserving host identity.

Resource Management and Optimization

Key techniques include service profiling, affinity/anti‑affinity analysis, scenario‑based priority, elastic scaling, and fine‑grained resource partitioning (NUMA, CPU‑Set).

Strategic optimizations address affinity, anti‑affinity, application priority, dispersion, isolation, and special resources (GPU, SSD, NIC).

Online Cluster Optimization

For online clusters, the team applies NUMA binding, CPU‑Set, application staggering, rescheduling, and interference analysis to improve performance and SLA.

Conclusion

Future work focuses on hybrid online‑offline deployment, intelligent scheduling with traffic and resource awareness, and high‑performance, strongly isolated, secure container technologies.

Author: Guo Liang, Senior Engineer, Meituan‑Dianping Cluster Scheduling Center.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kubernetesresource allocationCluster ManagementMeituanScheduling Optimization
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.