Automated Operations System for Meituan Delivery: Architecture, Monitoring, and Full‑Link Stress Testing
To handle Meituan’s 16 million daily orders and massive traffic spikes, the company built an automated operations platform that combines a real‑time business dashboard, health‑scoring core link, automated protection switches, and full‑link stress‑testing, enabling automatic anomaly detection, root‑cause diagnosis, capacity planning, and self‑remediation without manual intervention.
Meituan's food‑delivery business features a highly complex workflow (order → merchant → delivery) and experiences intense traffic spikes during lunch and dinner peaks. Since its launch in November 2013, daily orders have surged to over 16 million, generating massive data volumes (up to 120 billion accesses per day, QPS ~400 k). Manual troubleshooting is no longer sufficient, prompting the construction of an automated operations platform.
Business characteristics include:
Complex end‑to‑end process requiring sub‑half‑hour delivery.
Significant daily traffic surges, sometimes 2–3× the normal peak during promotional events.
Rapid growth: daily orders rose from launch to 16 million within four years.
Key pain points for developers :
Overwhelming volume of alerts and the need to standardize thresholds.
Multiple isolated monitoring systems requiring manual cross‑checking.
Proliferation of degradation and rate‑limit switches that need validation and capacity planning.
Manual, experience‑based incident diagnosis that could be standardized and automated.
The core objective is to automate these operations, freeing developers from routine monitoring and enabling faster, more accurate incident resolution.
System architecture consists of a Business Dashboard and a Core Link:
The Business Dashboard provides real‑time business‑level metrics, historical trends, and alert tagging. It supports mobile access, permission control, and predictive alerts using models such as Holt‑Winters. When an anomaly is detected, the system can automatically mark the event and guide developers to the relevant monitoring system.
The Core Link focuses on service‑node health scoring, root‑cause diagnosis, and SOP‑driven remediation. Nodes are scored based on weighted indicators (e.g., failure rate, TP99, error logs). Once a problematic link is identified, detailed metrics are fetched to pinpoint the exact cause (e.g., disk or CPU issues).
Service Protection & Fault‑Drill module integrates three types of protection switches:
Degradation switches (thousands of them in code).
Rate‑limit switches (per‑machine, cluster‑level, or custom scenarios).
Hystrix automatic circuit‑breakers.
Pre‑defined protection plans are triggered automatically when diagnostic models identify an anomaly. Fault‑drill exercises simulate failures (e.g., Tair outage) to validate the effectiveness of these plans.
Full‑Link Stress Testing is conducted regularly. Test traffic is isolated from production data, and scenarios are replayed to verify protection mechanisms and capacity planning. Automation reduces manual coordination, constructs test data, runs fault simulations, collects logs, and generates comprehensive reports.
Automation roadmap includes:
Automatic anomaly detection using baseline algorithms and confidence intervals.
Automatic triggering of protection actions based on diagnosed anomalies.
Automation of stress‑test planning, data preparation, scenario execution, and result reporting.
In conclusion, the platform continuously refines root‑cause detection, capacity planning, and automated remediation, aiming to detect any abnormal dimension, predict its impact on business metrics, and automatically execute appropriate protection measures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
