Meituan Elastic Scaling System: Architecture, Challenges, and Business Enablement
This article presents Meituan's elastic scaling platform, detailing its evolution from Hulk 1.0 to Hulk 2.0, the technical and operational challenges faced, the solutions implemented for resource management and multi‑tenant scaling, and real‑world business scenarios such as holiday, peak‑hour, and emergency capacity provisioning.
Introduction
Stable, efficient, and reliable infrastructure is the foundation for handling peak traffic in internet companies. Meituan's unified basic technology platform has built an elastic scaling system based on Docker and later Kubernetes to provide cost‑effective, automated resource provisioning.
1. Elastic Scaling System Evolution
1.0 Architecture
Hulk 1.0 introduced a container platform built on OpenStack, addressing slow instance provisioning, resource recovery, and redundancy.
Key modules include User Portal, Hulk‑ApiServer, Hulk‑Policy, Hulk Data Sources (OCTO, CAT, Falcon), and Scheduler.
2.0 Architecture
Hulk 2.0 replaces OpenStack with Kubernetes, introduces micro‑service components (API‑Server, Engine, Metrics‑Server/Data, Resource‑Server), builds a service‑portrait data platform, and adds observability modules (Alarm, Scanner).
2. Challenges and Countermeasures
2.1 Technical Challenges
Three‑phase goals: MVP (replace OpenStack with Kubernetes), pilot deployments, and unifying user‑side platforms to reduce cost.
Elastic scheduling issues such as missed expansions, over‑expansions, and IDC‑centric limitations.
Resource quota management and multi‑tenant isolation.
Configuration flexibility and integration with traffic‑group, elastic‑group, and release‑group mappings.
2.2 Promotion Strategy
Data‑driven identification of suitable services, value quantification (burst handling, cost saving, automation), deep business engagement, technical training, and closed‑loop feedback.
2.3 Operational Difficulties
Typical post‑deployment problems include configuration mismatches, startup failures, and performance bottlenecks, addressed through pre‑, during‑, and post‑deployment measures.
3. Business Enablement Scenarios
3.1 Holiday Scaling
Timed tasks automatically expand resources during holidays, reducing costs by ~20.4%.
3.2 Daily Peak Scaling
Scheduled expansions before lunch peaks and shrinkage afterward cut resident machines by 365 units, with peak‑hour elastic instances accounting for 15% of total capacity.
3.3 Emergency Resource Assurance
During large events, Meituan purchases public‑cloud hosts as emergency pools, providing >700 high‑spec containers (≈7000 CPU cores) for anti‑fraud services.
3.4 Service‑Chain Scaling
Automated monthly topology tasks replace manual scaling for over 70 services, dramatically improving efficiency.
4. Future Plans
Focus on stability (system robustness, instance health), usability (pre‑run simulations, auto‑tuning), expanded business solutions (link‑level scaling, dedicated zone scaling), and exploration of new cloud‑native technologies such as Knative and KEDA.
Author
Tu Yang, Head of Elastic Strategy Team, Meituan Infrastructure Department.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
