Cloud Native 22 min read

How Meituan Scaled Its Container Platform to 30,000 Pods: Lessons from HULK

This article details Meituan's journey from early container adoption to the HULK 2.0 platform, covering architecture, isolation, stability, performance, and promotion challenges, and shares the engineering solutions that enabled over 30,000 containers to run reliably at massive scale.

21CTO
21CTO
21CTO
How Meituan Scaled Its Container Platform to 30,000 Pods: Lessons from HULK

Background

Meituan's container cluster management platform is called HULK, named after the "green giant" for its elastic scaling similarity to containers. Since 2016 Meituan began using containers while retaining existing systems such as CMDB, service governance, monitoring, and deployment platforms, integrating container lifecycle with these assets.

In 2018 HULK 2.0 upgraded the OpenStack‑based scheduler to the industry‑standard Kubernetes (K8s) and added richer elasticity policies and numerous optimizations.

Today Meituan runs more than 3,000 services on over 30,000 container instances, handling high‑concurrency, low‑latency core services.

Basic Architecture of Meituan's Container Platform

The platform integrates with service governance, deployment, CMDB, and monitoring systems, providing a VM‑like experience for developers.

Key layers (from bottom to top):

Physical resources: CPU, memory, disk, network.

Host OS: CentOS 7 with a customized 3.10 Linux kernel adding Meituan‑specific features and tuning for high‑concurrency workloads.

Container runtime: Docker 1.13 with Meituan‑added features; HULK Agent manages hosts; Falcon Agent collects metrics.

Container images: support for CentOS 6 and 7, init processes, systemd, and a wide range of languages (Java, Python, Node.js, C/C++). Additional agents provide service governance, logging, encryption, etc.

Isolation Challenges and Solutions

Containers originally reported host‑level CPU and memory, causing "self‑inflation" and OOM failures. Meituan implemented kernel changes similar to LXCFS so that memory and CPU queries respect cgroup limits.

For Java workloads, inaccurate CPU counts led to excessive GC threads. Solutions included explicit JVM flags, a patched glibc that reports correct CPU numbers, and kernel patches that make the correct count transparent.

Root‑privileged containers caused security and performance issues; the platform now drops root, granting only sys_ptrace and sys_admin, with per‑service permission overrides.

Stability Improvements

Stability issues stemmed from Linux kernel and Docker bugs. Meituan contributed patches for kernel buffer‑IO limits, Ext4 bugs, and Docker exec/daemon crashes, collaborating with Red Hat and upstream communities.

Performance Optimizations

Performance focuses on service throughput inside containers and container operation latency.

CPU allocation respects NUMA topology, dedicating specific logical CPUs for network interrupts and host tasks, and avoiding cross‑node scheduling when possible, yielding >30% throughput gains for compute‑intensive workloads.

File system tuning selected Ext4 with Writeback mode for speed, and tmpfs for temporary files.

Image handling improvements include:

Multi‑site image synchronization and pre‑distribution of base images.

P2P image distribution to reduce bandwidth pressure.

Parallel layer decompression using pgzip and a RAM‑disk for temporary data, dramatically cutting image pull and expand times.

Promotion Strategies

To drive container adoption, Meituan emphasizes three advantages: lightweight and fast startup, identical development‑test‑production environments via image distribution, and elastic scaling based on resource or business metrics.

Product positioning, seamless integration with existing systems, developer‑friendly tooling, smooth VM‑to‑container migration paths, close collaboration with application teams, and strategic resource allocation toward containers are also highlighted.

Conclusion

Docker containers combined with Kubernetes orchestration form the mainstream cloud‑native practice, and Meituan's HULK platform embodies this approach. The article shares Meituan's kernel, Docker, and Kubernetes optimizations, as well as broader thoughts on advancing containerization at massive scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kubernetescontainer orchestrationResource IsolationImage DistributionLarge‑Scale Deployment
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.