Cloud Native 17 min read

Design and Implementation of the HULK Container Platform Scheduling System

The HULK Container Platform scheduling system, built for Meituan‑Dianping, combines a hybrid, actor‑based scheduler with filter‑and‑rank logic, configurable trade‑offs, and dynamic over‑commit to balance resource utilization, high availability, and massive concurrent placement decisions for thousands of containerized services.

Meituan Technology Team

May 12, 2017

Design and Implementation of the HULK Container Platform Scheduling System

Background – Meituan‑Dianping, the largest domestic O2O platform, experiences extreme traffic fluctuations, especially during holidays and promotions. Traditional VM‑based deployment struggled with slow instance creation, cumbersome configuration changes, difficult resource reclamation, and low utilization during off‑peak periods.

To address these issues, the team adopted Docker containers for elastic scaling, which provide OS‑level isolation, fast startup, and fine‑grained resource control.

HULK Container Platform Overview – Launched in mid‑2015, the HULK project is Meituan‑Dianping’s company‑wide container cluster management and elastic scaling platform. Its goal is to containerize services, automate scaling, improve resource efficiency, and reduce operational costs.

The platform’s name references the Marvel “Hulk” superhero, symbolizing the desired robustness and elasticity of the services after integration.

Scheduling System Role – The scheduler is the core module that uniformly allocates resources from a shared pool. Its responsibilities include handling resource requests from upper‑level scaling modules, applying multi‑objective scheduling algorithms, and interfacing with the underlying IaaS layer.

Key Metrics – The design focuses on three primary indicators:

Resource utilization – aiming for 30%‑70% overall cluster usage.

Business optimization – ensuring high availability and stable inter‑service communication.

Concurrent scheduling capacity – providing rapid decisions for thousands of simultaneous requests.

Design Challenges – The scheduler must balance these metrics, which are often mutually exclusive, similar to the CAP theorem trade‑offs between consistency, availability, and partition tolerance.

Industry Solutions – The article reviews approaches from Mesos (pessimistic lock, offer‑based allocation), Omega (optimistic MVCC lock), and Borg/Kubernetes (serial plugin scheduling).

HULK Solution – HULK adopts a hybrid strategy:

It sacrifices some resource utilization to gain higher concurrent scheduling throughput in high‑traffic scenarios.

It supports configurable trade‑offs based on workload type (e.g., AP‑oriented internet services vs. CA‑oriented financial services).

The scheduler consists of three components: a request queue, a scheduling computation module, and a shared resource pool. The workflow is:

Upper‑level elastic scaling writes task IDs to the request queue.

The computation module consumes IDs, selects optimal placement, and requests resources from the pool.

Each host in the pool is represented by an Actor that manages its own resources.

Scheduling Computation – Uses a filter‑and‑rank approach similar to Kubernetes:

Hosts: shared cluster with hard and soft isolation (e.g., cgroups).

Filter (Predicates): eliminates hosts that violate over‑commit or anti‑affinity policies.

Rank (Priorities): scores remaining hosts based on factors such as mixed online/offline placement, load balancing, and custom weights.

Over‑commit Mechanism – Compressible resources (CPU, I/O) are over‑committed with dynamic coefficients derived from real‑time host metrics; non‑compressible resources (memory, disk) are over‑committed only in test environments.

Instance Dispersion – Deploys instances of the same service across different hosts and racks to improve fault tolerance. Special constraints apply to Redis clusters (no more than 25% of instances on the same switch).

Online‑Offline Co‑placement – Mixes latency‑sensitive online services with batch/offline jobs to improve overall utilization while respecting the risk of OOM for Java services.

Host Load Balancing – Scheduler weights hosts with lower CPU/Load/Memory/I/O metrics more heavily during ranking.

Resource Pool Allocation – After ranking, the scheduler attempts the top‑N candidate hosts sequentially until one accepts the request. The value of N is dynamically adjusted based on current request volume.

Summary of Scheduling Model – HULK’s shared‑state, actor‑based approach resembles Omega’s optimistic concurrency but replaces the MVCC database with lightweight actors, achieving higher concurrency and lower retry overhead.

Conclusion – The HULK scheduling system provides a scalable, cloud‑native solution for large‑scale containerized workloads at Meituan‑Dianping, with future work planned for deeper intelligent scheduling of big‑data offline tasks and continued collaboration with open‑source projects such as Kubernetes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Docker resource allocation Cloud-native container scheduling HULK

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.