How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing
This article explains how Meituan's food delivery platform built an automated operations system—covering complex workflows, traffic spikes, rapid growth, pain‑point analysis, core goals, system architecture, and automation techniques such as anomaly detection, service‑protection triggers, and full‑link testing—to improve reliability and reduce manual effort.
Background
Meituan's food‑delivery business is unique in the Internet industry: the workflow is complex—from user order, merchant acceptance, delivery‑person assignment to final delivery—and traffic concentrates heavily during lunch and dinner peaks. Since its launch in November 2013, the service has grown rapidly, reaching a peak of 16 million orders per day in less than four years. Manual troubleshooting alone cannot keep up with this scale, prompting the need for an automated operations system.
Delivery Business Characteristics
Complex Business Process
The platform must complete the entire order cycle—user order → system dispatch to merchant → merchant prepares food → delivery → user receives the meal—within half an hour, while handling numerous data‑analysis, settlement, and contract interactions, resulting in high consistency and concurrency requirements.
Sharp Daily Traffic Spikes
Traffic surges dramatically at specific times each day, and promotional activities can increase peak load to two‑ or three‑times the normal lunch peak.
Rapid Business Growth
From 2013 to October 2017, daily orders grew from launch to over 16 million, with some services handling more than 12 billion data accesses per day and QPS approaching 400 k. Even a small incident during peak hours can cause substantial loss.
Problems to Solve
Four major pain points hinder developers:
Excessive alert noise in IM channels; need standardized, automated alert metrics and thresholds.
Multiple isolated monitoring systems requiring manual context switching.
Numerous degradation and rate‑limit switches that evolve rapidly; capacity planning needs full‑link stress testing.
Manual, experience‑based incident diagnosis that could be standardized and automated.
Core Goals
Automate operations to free developers from routine monitoring, enabling a workflow of detecting, diagnosing, and resolving issues with increasing accuracy, eventually turning high‑confidence scenarios into fully automated actions.
Key System Architecture
Overall Architecture
The system consists of a Business Dashboard and a Core Link as entry points. When a metric anomaly is detected, the Core Link analyzes service health scores to pinpoint the root cause and suggests appropriate service‑protection plans. Continuous full‑link stress testing validates diagnosis and protection effectiveness, supporting capacity planning.
Business Dashboard
Provides real‑time business‑metric views and historical trends, supports automatic anomaly tagging, quick navigation to other monitoring systems, and permission‑controlled access. It also offers mobile access for on‑the‑go monitoring.
Core Link
Assigns health scores to service nodes based on weighted indicators (e.g., failure rate, TP99, error logs). When a problematic node is identified, detailed metrics are collected to diagnose issues such as disk or CPU problems. Feedback from developers refines scoring and diagnosis models, eventually enabling automatic protection triggers.
Service Protection & Fault Drills
Implements various protection switches: degradation switches, rate‑limit switches, and Hystrix circuit breakers. Pre‑defined protection plans are automatically invoked when diagnosed anomalies match known scenarios, and fault‑drill exercises validate plan effectiveness.
Full‑Link Stress Testing Integration
Regular full‑link tests simulate traffic and fault scenarios, verify protection plan activation, and generate automated reports, reducing manual coordination.
Automation Journey
Anomaly Auto‑Detection
Historical data is analyzed to compute baseline algorithms and confidence intervals for each metric, automatically identifying outliers and adjusting alert thresholds.
Automatic Service‑Protection Triggering
Diagnosed anomalies are linked to predefined protection plans, allowing automatic activation of degradation, rate‑limit, or circuit‑breaker mechanisms.
Testing Plan Automation
Stress‑test preparation (data masking, validation) is automated, fault scenarios are injected during replay, and protection plans are triggered accordingly, with end‑to‑end monitoring and report generation.
Conclusion
Accurate root‑cause identification and diagnosis enable progressive automation of operational actions such as switch activation and capacity scaling, ultimately improving reliability and resource efficiency for Meituan’s delivery platform.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.