How Meituan Optimized Kubernetes at Scale: Lessons from HULK2.0
This article details Meituan‑Dianping's evolution from a custom Docker‑based cluster manager to the open‑source Kubernetes‑powered HULK2.0 platform, describing its architecture, operational practices, scheduler and Kubelet optimizations, and resource‑management techniques that enable massive, cost‑effective scaling.
Background
Meituan‑Dianping, a leading Chinese life‑service platform, experiences pronounced traffic peaks during holidays and promotions, demanding highly elastic and available clusters while keeping operational costs under control.
The article introduces the company's Kubernetes cluster management practices, covering the evolution of its internal scheduling system (HULK) and subsequent optimizations.
Meituan‑Dianping Cluster Management and Scheduling System
Since 2013 the company built a virtualization‑based resource delivery model, launching the HULK system in 2015 to drive containerization. By 2016 a self‑developed Docker‑based elastic scaling solution improved delivery speed and reduced IT costs. In 2018 the platform migrated to Kubernetes, creating HULK2.0.
Architecture Overview
HULK2.0 decouples business layers from the underlying Kubernetes platform, exposing a unified HULK API that abstracts resource requests, while remaining compatible with native Kubernetes APIs.
Why Kubernetes?
Kubernetes offers a platform rather than a single solution, providing extensibility, mature ecosystem support, and flexible resource allocation, which aligns with Meituan‑Dianping’s need for rapid scaling and cost efficiency.
Cluster Operation Status
Scale: over 100,000 online instances across multiple regions.
Monitoring & alerts for applications, nodes, pods, and containers.
Automated health checks, daily host inspections, and resource visualizations.
Capacity planning using rule‑based and machine‑learning predictions.
Kubernetes Optimization and Refactoring
Kube‑Scheduler Performance Optimization
Upgrading from the 1.6 scheduler to newer versions eliminated a 5‑second per‑pod scheduling delay in a 3,000‑node cluster, achieving >400% performance improvement.
Pre‑filter Abort Mechanism
Introducing an early‑exit strategy during the predicate phase stops evaluating a node once any pre‑filter condition fails, dramatically reducing computation and boosting scheduler throughput.
This change was contributed to the Kubernetes community as the alwaysCheckAllPredicates option, becoming the default in version 1.10.
Local‑Optimal Scheduling
Instead of exhaustive BestFit across all nodes, the platform selects a subset (e.g., 100 nodes) and chooses the highest‑scoring node within that subset, achieving comparable performance with far less computation.
Kubelet Refactoring
Risk Control
The team limited Kubelet’s autonomous eviction and restart behaviors, adding a reusable restart strategy that preserves container data across host reboots.
IP Retention
A custom CNI plugin reuses pod IPs after migration or host restart, improving stability.
Scalability Enhancements
Features added include NUMA binding, CPU share adjustments, CPUSet assignments, and extended container limits (ulimit, I/O, PID, swap).
In‑Place Application Upgrade
Implemented a mechanism to modify pod specifications (e.g., image) without recreating the pod, avoiding IP/hostname changes and reducing disruption.
Image Distribution Optimizations
Cross‑site synchronization for nearby image pulls.
Pre‑distribution of base images to all servers.
P2P image sharing to alleviate registry load.
Resource Management and Optimization
Key Techniques
Service profiling for CPU, memory, network, and I/O usage.
Affinity and anti‑affinity rules to co‑locate complementary workloads.
Scenario‑based priority (e.g., latency‑sensitive services).
Elastic scaling with rule‑based and ML‑driven policies.
Fine‑grained resource allocation (NUMA, CPUSet, etc.).
Strategy Optimization
Affinity and anti‑affinity constraints.
Application priority levels for resource contention.
Dispersal across hosts, racks, zones for fault tolerance.
Isolation for exclusive workloads.
Special resource handling for GPU, SSD, NICs.
Online Cluster Optimization
NUMA binding to reduce cross‑node latency.
CPUSet grouping of complementary applications.
Staggered workload peaks based on service profiles.
Rescheduling to improve placement and reduce fragmentation.
Interference analysis using monitoring metrics.
Conclusion
Meituan‑Dianping continues to explore mixed online‑offline deployments, intelligent scheduling aware of traffic and resource usage, and high‑performance, strongly isolated, secure container technologies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
