Cloud Native 22 min read

Meituan Elastic Scaling System: Architecture, Challenges, and Business Enablement

This article presents Meituan's elastic scaling platform, detailing its evolution from Hulk 1.0 to Hulk 2.0, the technical and operational challenges faced, the solutions implemented for resource management and multi‑tenant scaling, and real‑world business scenarios such as holiday, peak‑hour, and emergency capacity provisioning.

High Availability Architecture

Apr 15, 2021

Meituan Elastic Scaling System: Architecture, Challenges, and Business Enablement

Introduction

Stable, efficient, and reliable infrastructure is the foundation for handling peak traffic in internet companies. Meituan's unified basic technology platform has built an elastic scaling system based on Docker and later Kubernetes to provide cost‑effective, automated resource provisioning.

1. Elastic Scaling System Evolution

1.0 Architecture

Hulk 1.0 introduced a container platform built on OpenStack, addressing slow instance provisioning, resource recovery, and redundancy.

Key modules include User Portal, Hulk‑ApiServer, Hulk‑Policy, Hulk Data Sources (OCTO, CAT, Falcon), and Scheduler.

2.0 Architecture

Hulk 2.0 replaces OpenStack with Kubernetes, introduces micro‑service components (API‑Server, Engine, Metrics‑Server/Data, Resource‑Server), builds a service‑portrait data platform, and adds observability modules (Alarm, Scanner).

2. Challenges and Countermeasures

2.1 Technical Challenges

Three‑phase goals: MVP (replace OpenStack with Kubernetes), pilot deployments, and unifying user‑side platforms to reduce cost.

Elastic scheduling issues such as missed expansions, over‑expansions, and IDC‑centric limitations.

Resource quota management and multi‑tenant isolation.

Configuration flexibility and integration with traffic‑group, elastic‑group, and release‑group mappings.

2.2 Promotion Strategy

Data‑driven identification of suitable services, value quantification (burst handling, cost saving, automation), deep business engagement, technical training, and closed‑loop feedback.

2.3 Operational Difficulties

Typical post‑deployment problems include configuration mismatches, startup failures, and performance bottlenecks, addressed through pre‑, during‑, and post‑deployment measures.

3. Business Enablement Scenarios

3.1 Holiday Scaling

Timed tasks automatically expand resources during holidays, reducing costs by ~20.4%.

3.2 Daily Peak Scaling

Scheduled expansions before lunch peaks and shrinkage afterward cut resident machines by 365 units, with peak‑hour elastic instances accounting for 15% of total capacity.

3.3 Emergency Resource Assurance

During large events, Meituan purchases public‑cloud hosts as emergency pools, providing >700 high‑spec containers (≈7000 CPU cores) for anti‑fraud services.

3.4 Service‑Chain Scaling

Automated monthly topology tasks replace manual scaling for over 70 services, dramatically improving efficiency.

4. Future Plans

Focus on stability (system robustness, instance health), usability (pre‑run simulations, auto‑tuning), expanded business solutions (link‑level scaling, dedicated zone scaling), and exploration of new cloud‑native technologies such as Knative and KEDA.

Author

Tu Yang, Head of Elastic Strategy Team, Meituan Infrastructure Department.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Resource Management Elastic Scaling Meituan

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.