High Availability Architecture for Meituan Waimai Mobile Client
Meituan Waimai’s mobile client employs a high‑availability architecture built on loosely‑coupled teams, comprehensive monitoring, encrypted logging, multi‑layer disaster recovery, gray‑release strategies, and an incident‑response workflow, enabling rapid detection and resolution of failures while supporting 20 million daily orders.
Meituan Waimai has grown into the world’s largest food‑delivery platform, handling over 20 million orders per day. Rapid business growth demands a highly stable client system, prompting the construction of a comprehensive high‑availability (HA) framework for the mobile app.
HA Design Philosophy – Independent small teams own clearly defined responsibilities, adopting a loosely‑coupled architecture that isolates module changes and improves both flexibility and robustness.
The overall client architecture centers on the transaction chain (store recall → product display → order). Independent operational units are kept simple to ensure reliability while allowing continuous feature iteration.
The lifecycle of issues is divided into three stages: discovery,定位 (location), and resolution. Continuous construction around these stages forms the core of the HA system.
Monitoring & Alerting – Metrics are collected for business stability, basic capability stability, and performance stability. Baselines and business‑specific models drive minute‑level alerts via email, IM, or SMS. The system follows SRE principles: simple, reliable, and regularly cleaned up.
“The best rules reflect real failures, are highly predictable, reliable, and as simple as possible.”
Since mid‑2017, over 20 critical incidents (e.g., crawler spikes, traffic surges, carrier 403 errors) have been detected and resolved using this monitoring framework.
Logging System – Three log types are maintained: full‑volume logs (overall health), individual logs (user‑specific issues), and exception logs (crashes, share failures). The Logan library provides encrypted, locally stored logs that are only uploaded for analysis, protecting user privacy.
Disaster Recovery – Three mechanisms are employed:
Degradation: non‑core services can be gracefully downgraded via configuration pushes, protecting core transaction paths.
Backup: multiple network channels (Shark, HTTP, HTTPS, HTTP‑DNS) provide redundancy; automatic city‑level channel switching occurs within minutes of degradation.
Rate Limiting: real‑time traffic control prevents overload, with multi‑level policies (captcha, queue, drop).
Release Strategy – Version and feature gray releases are used to limit risk. iOS follows Apple’s phased rollout; Android uses Meituan’s EVA package manager. Feature flags are scoped by city, user ID, etc., with separate test and production environments and defined rollback procedures.
Online Operations – A complete incident‑response workflow (discover → locate → resolve → prevent) is practiced, with regular drills to reduce mean time to recovery.
Future Outlook – The team aims to build an intelligent O&M system that automatically detects anomalies, triggers remediation playbooks, and further shortens recovery time, extending SRE practices to the mobile front‑end.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
