Cloud Native 17 min read

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

This article details Meituan-Dianping's evolution from custom Docker‑based scaling to a Kubernetes‑driven, cloud‑native cluster management platform (HULK), describing its architecture, scheduler enhancements, Kubelet modifications, and resource‑optimization strategies for large‑scale operations.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

Background: Meituan‑Dianping, a leading Chinese lifestyle service platform, experiences extreme traffic peaks during holidays and promotions, requiring highly elastic and cost‑effective cluster resources to maintain user experience.

The company introduced a custom cluster management and scheduling system called HULK, evolving from Docker‑based elastic scaling (HULK 1.0) to a Kubernetes‑integrated solution (HULK 2.0) to improve resource utilization and operational efficiency.

Architecture Overview
Architecture Overview

Architecture Overview: HULK abstracts business‑level resource requests through a unified HULK API, decoupling the upper‑layer services from the underlying Kubernetes platform while preserving compatibility with native Kubernetes APIs.

Why Kubernetes? The platform offers a robust, extensible foundation for large‑scale, elastic workloads, enabling rapid deployment, dynamic scaling, and sophisticated scheduling policies.

Cluster Operations Status: Over 100,000 online instances across multiple regions, with comprehensive monitoring, health alerts, automated inspections, visual dashboards, and capacity planning powered by rule‑based and machine‑learning predictions.

Kube‑Scheduler Optimizations: Upgraded from version 1.6 to 1.10+, achieving >400% performance gain; introduced the alwaysCheckAllPredicates option to abort pre‑filter checks early, reducing scheduling latency by up to 40%.

Local‑Optimal Scheduling: Adopted a “best‑of‑N” approach (e.g., evaluating 100 candidate nodes instead of all 1000) to obtain near‑optimal placement with significantly lower computation.

Kubelet Enhancements: Added a “Reuse” restart policy alongside the native “Rebuild” policy, implemented a custom CNI plugin for IP reuse, and constrained aggressive eviction to improve risk controllability.

Resource Management & Optimization: Employed service profiling, affinity/anti‑affinity rules, scenario‑based priority, elastic scaling, fine‑grained resource allocation (NUMA binding, CPU‑Set), image distribution optimizations (cross‑site sync, base‑image pre‑distribution, P2P sharing), and online‑cluster techniques such as Numa binding, application‑level throttling, and re‑scheduling.

Conclusion: Ongoing efforts focus on hybrid online‑offline deployment, intelligent scheduling, and high‑performance, secure container technologies to further boost resource efficiency and service SLA.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativePerformance OptimizationKubernetesresource schedulingCluster ManagementMeituan-Dianping
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.