Meituan HULK: Cloud‑Native Container Cluster Management and Scheduling Practices
Meituan’s HULK platform evolved from an OpenStack‑based scheduler to a Kubernetes‑native container cluster manager, integrating service governance, release, CMDB, and monitoring to automate VM‑to‑container migration, improve resource utilization, and deliver elastic, policy‑driven scheduling and scaling with reduced troubleshooting time and higher SLA compliance.
Meituan’s HULK platform is a container‑based cluster management system that evolved from an OpenStack‑based scheduler (HULK 1.0) to a Kubernetes‑native solution (HULK 2.0). The platform integrates service governance, release, testing, CMDB, and monitoring systems, enabling seamless migration from VMs to containers.
Key motivations include inconsistent environment configurations and long provisioning cycles, which caused low resource utilization during off‑peak periods. HULK provides unified container runtime, elastic scheduling, and a one‑stop operations platform.
Architecture : The top layer connects to various internal platforms (service governance, release, CMDB, monitoring). The middle layer implements elastic container provisioning, service profiling, and Kubernetes orchestration. The bottom layer runs the HULK Agent on each node.
Scheduling pain points & solutions :
High troubleshooting cost – introduced a TaskId tracing mechanism integrated with the internal log center and visualized in HULK Portal, reducing diagnosis time from ~30 min to minutes.
Custom business requirements – built a unified policy configuration center to generate Kubernetes‑compatible manifests, freeing operators from manual per‑service tweaks.
Scheduler performance – optimized predicate filtering by short‑circuiting failing nodes, improving scheduling latency by 40 % (adopted as the default in Kubernetes 1.10).
Resource contention – added service profiling to prioritize high‑SLA workloads during contention.
Elastic scaling pain points & solutions :
Inconsistent decisions from multiple policies – introduced an aggregation layer with default and weighted strategies to eliminate contradictory scaling actions.
Non‑idempotent scaling – switched to target‑based scaling (e.g., “scale to 20 instances”) to avoid duplicate expansions.
Multiple version deployments – enforced stable, gray‑released images for scaling to prevent unstable builds from being promoted.
Resource guarantee – implemented water‑level detection and predictive resource estimation to ensure new services do not starve existing ones.
End‑to‑end latency – upgraded monitoring to second‑level granularity and explored predictive scaling based on behavior data.
Experience summary :
Technical: native Kubernetes must be integrated with internal systems for practical adoption; incremental scheduling uses new policies while legacy workloads are re‑scheduled.
Operational: private‑cloud elastic scaling requires SLA guarantees for success rate and latency.
Business: automated migration from VM to containers and higher resource utilization significantly reduce operational cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
