Operations 17 min read

How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing

This article explains how Meituan's food delivery platform built an automated operations system—covering complex workflows, traffic spikes, rapid growth, pain‑point analysis, core goals, system architecture, and automation techniques such as anomaly detection, service‑protection triggers, and full‑link testing—to improve reliability and reduce manual effort.

Efficient Ops
Efficient Ops
Efficient Ops
How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing

Background

Meituan's food‑delivery business is unique in the Internet industry: the workflow is complex—from user order, merchant acceptance, delivery‑person assignment to final delivery—and traffic concentrates heavily during lunch and dinner peaks. Since its launch in November 2013, the service has grown rapidly, reaching a peak of 16 million orders per day in less than four years. Manual troubleshooting alone cannot keep up with this scale, prompting the need for an automated operations system.

Delivery Business Characteristics

Complex Business Process

The platform must complete the entire order cycle—user order → system dispatch to merchant → merchant prepares food → delivery → user receives the meal—within half an hour, while handling numerous data‑analysis, settlement, and contract interactions, resulting in high consistency and concurrency requirements.

Sharp Daily Traffic Spikes

Traffic surges dramatically at specific times each day, and promotional activities can increase peak load to two‑ or three‑times the normal lunch peak.

Rapid Business Growth

From 2013 to October 2017, daily orders grew from launch to over 16 million, with some services handling more than 12 billion data accesses per day and QPS approaching 400 k. Even a small incident during peak hours can cause substantial loss.

Problems to Solve

Four major pain points hinder developers:

Excessive alert noise in IM channels; need standardized, automated alert metrics and thresholds.

Multiple isolated monitoring systems requiring manual context switching.

Numerous degradation and rate‑limit switches that evolve rapidly; capacity planning needs full‑link stress testing.

Manual, experience‑based incident diagnosis that could be standardized and automated.

Core Goals

Automate operations to free developers from routine monitoring, enabling a workflow of detecting, diagnosing, and resolving issues with increasing accuracy, eventually turning high‑confidence scenarios into fully automated actions.

Key System Architecture

Overall Architecture

The system consists of a Business Dashboard and a Core Link as entry points. When a metric anomaly is detected, the Core Link analyzes service health scores to pinpoint the root cause and suggests appropriate service‑protection plans. Continuous full‑link stress testing validates diagnosis and protection effectiveness, supporting capacity planning.

Business Dashboard

Provides real‑time business‑metric views and historical trends, supports automatic anomaly tagging, quick navigation to other monitoring systems, and permission‑controlled access. It also offers mobile access for on‑the‑go monitoring.

Core Link

Assigns health scores to service nodes based on weighted indicators (e.g., failure rate, TP99, error logs). When a problematic node is identified, detailed metrics are collected to diagnose issues such as disk or CPU problems. Feedback from developers refines scoring and diagnosis models, eventually enabling automatic protection triggers.

Service Protection & Fault Drills

Implements various protection switches: degradation switches, rate‑limit switches, and Hystrix circuit breakers. Pre‑defined protection plans are automatically invoked when diagnosed anomalies match known scenarios, and fault‑drill exercises validate plan effectiveness.

Full‑Link Stress Testing Integration

Regular full‑link tests simulate traffic and fault scenarios, verify protection plan activation, and generate automated reports, reducing manual coordination.

Automation Journey

Anomaly Auto‑Detection

Historical data is analyzed to compute baseline algorithms and confidence intervals for each metric, automatically identifying outliers and adjusting alert thresholds.

Automatic Service‑Protection Triggering

Diagnosed anomalies are linked to predefined protection plans, allowing automatic activation of degradation, rate‑limit, or circuit‑breaker mechanisms.

Testing Plan Automation

Stress‑test preparation (data masking, validation) is automated, fault scenarios are injected during replay, and protection plans are triggered accordingly, with end‑to‑end monitoring and report generation.

Conclusion

Accurate root‑cause identification and diagnosis enable progressive automation of operational actions such as switch activation and capacity scaling, ultimately improving reliability and resource efficiency for Meituan’s delivery platform.

monitoringAutomationoperationsfull-link testingMeituanservice protection
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.