Operations 24 min read

Meituan Elastic Scaling System: Evolution, Challenges, and Business Enablement

This article introduces Meituan's elastic scaling platform, detailing its evolution from version 1.0 to 2.0, the technical and operational challenges faced, the strategies adopted for promotion and resource management, and several real‑world business scenarios where elastic scaling reduces cost and improves reliability.

High Availability Architecture

May 3, 2021

Meituan Elastic Scaling System: Evolution, Challenges, and Business Enablement

Elastic scaling provides business value such as handling sudden spikes, reducing costs, and enabling automation. The platform aggregates scattered idle resources into a large resource pool and balances cost and performance through elastic scheduling and inventory control.

This article introduces Meituan's elastic scaling system, the technical challenges encountered during its rollout, the promotion strategy, and operational considerations. In Meituan's diverse business environment, elastic scaling shares commonalities with public‑cloud and private‑cloud solutions while also presenting unique characteristics.

1. Elastic Scaling System Evolution

Elastic Scaling 1.0 Architecture

Meituan first experimented with containers in 2016, launching the OpenStack‑based container cluster platform Hulk 1.0 . After two years of exploration, Elastic Scaling 1.0 was released to solve slow instance provisioning, slow rollout, slow resource reclamation, and resource redundancy.

From 2018 onward, the container platform was upgraded to Hulk 2.0 , which replaced OpenStack with the de‑facto standard Kubernetes and introduced Elastic Scaling 2.0 on the PaaS layer.

Inconsistent business code versions : caused logic errors and financial loss.

Insufficient resources during peak periods : prevented services from handling traffic.

High platform maintenance cost : separate management platforms for Beijing and Shanghai.

Low configuration flexibility : each new IDC required manual configuration.

2. Challenges and Countermeasures

In 2018, before Hulk 2.0, the 1.0 platform suffered from slow provisioning, slow rollout, and resource waste. Hulk 2.0 was designed with four major architectural upgrades:

Scheduler replacement : OpenStack was swapped for Kubernetes, and dedicated and emergency resource pools were added.

Monolith to micro‑service : API‑Server, Engine, Metrics‑Server/Data, and Resource‑Server were introduced.

Service portrait data platform : Portrait‑Server and Portrait‑Data provide profiling.

Observability : Alarm and Scanner handle monitoring and operational governance.

Technical challenges (Phase 1‑3)

Phase 1: MVP of Elastic Scaling 2.0 – replace OpenStack ecosystem with Kubernetes while keeping existing functionality.

Phase 2: Pilot with selected services, integrate Beijing and Shanghai CMDB logic.

Phase 3: Consolidate user‑side portals into a single interface to reduce learning and maintenance cost.

Elastic scheduling

Typical workflow includes creating an elastic group, configuring instance specs, defining scaling rules, and setting up monitoring or timed tasks. Real‑world issues observed:

"Expand but not expanded" – new IDC instances were not added to an elastic group.

"Should not expand but expanded" – retired IDC still had active scaling tasks.

IDC‑centric scaling limited global optimization.

Inconsistent business logic across IDC caused anomalies.

To address these, Meituan aligned traffic groups, elastic groups, and release groups, ensuring one‑to‑one mapping between traffic and elastic groups.

Inventory control

Elastic scaling must guarantee resource availability while avoiding idle capacity. Meituan adopts multi‑tenant quotas, a 99.9 % scaling success SLA, and a water‑level monitoring mechanism to balance over‑commitment.

Multi‑tenant management : each business line receives a default quota; adjustments are handled via tickets.

Normal‑state resource guarantee : hourly prediction of pool usage; changes that would exceed capacity are rejected.

Emergency‑state resource guarantee : combines resident pool with on‑demand public‑cloud resources; emergency resources are released after the event.

3. Business Enablement Scenarios

3.1 Holiday Scaling

During holidays traffic can be 3‑5 times the weekday level. By configuring timed scaling tasks, Meituan expands resources before holidays and contracts them afterward, saving an average of 20.4 % cost.

3.2 Daily Peak Scaling

For the lunch‑time surge in delivery services, timed tasks expand 125 instances at 09:55 and shrink them at 14:00, reducing permanent machines by 365 units while keeping peak capacity.

3.3 Emergency Resource Assurance

Anti‑scraping services experience massive traffic during promotions. By procuring public‑cloud hosts as emergency resources, Meituan supplied over 700 high‑spec containers (≈7 000 CPU cores) across five large events.

3.4 Service‑Chain Scaling

For SaaS customers needing temporary gray‑release machines, a chain‑topology task automatically expands and contracts resources each month, reducing manual effort from dozens of owners to a single engineer.

4. Future Plans

Meituan will continue to deepen Elastic Scaling 2.0 in four directions:

Stability : improve system robustness, instance health detection, and resource QoS.

Usability : enhance pre‑run prediction, automate task recommendations, and provide post‑deployment benefit reports.

Business solutions : extend link‑level scaling, support dedicated zones such as Oceanus for finance.

New‑technology exploration : adopt concepts from Knative and KEDA for cloud‑native workloads.

Author

Tu Yang, Head of Elastic Strategy Team, Meituan Infrastructure Department.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Resource Management Elastic Scaling Meituan

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.