Backend Development 12 min read

Meituan Instant Delivery: Evolution of Distributed System Architecture and Operational Practices

The article details Meituan's five‑year journey in instant logistics, describing how its distributed high‑concurrency architecture, AI‑driven optimization, micro‑service evolution, and reliability engineering practices have been continuously refined to achieve low latency, high availability, and cost‑effective scaling.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Meituan Instant Delivery: Evolution of Distributed System Architecture and Operational Practices

Background

Meituan's instant delivery has been developing for five years, and its instant logistics exploration spans more than three years; the business grew from zero to a sizable scale, accumulating experience in building distributed high‑concurrency systems. The main takeaways are twofold: the instant logistics business tolerates almost no failures or high latency, and it heavily relies on AI to improve cost, efficiency, and user experience.

Instant logistics requires extremely low tolerance for faults and latency, demanding a distributed, scalable, and fault‑tolerant architecture that ultimately eliminates system downtime.

By integrating AI across pricing, ETA estimation, dispatch, capacity planning, subsidies, accounting, voice interaction, LBS mining, operations, and monitoring, the system achieves scale growth, experience preservation, and cost reduction.

The article introduces the technical obstacles and challenges encountered during the layered evolution of Meituan's instant logistics distributed system architecture:

Massive order and rider scale leading to ultra‑large‑scale matching computations.

Holiday or severe weather spikes causing traffic peaks many times higher than normal.

Logistics fulfillment is the critical link between online and offline, with near‑zero fault tolerance and strict availability requirements.

High demands for real‑time and accurate data, making the system extremely sensitive to latency and anomalies.

Meituan Instant Delivery Architecture

Meituan's instant delivery platform focuses on three aspects: (1) providing SLA guarantees such as ETA calculation and delivery fee pricing; (2) matching the most suitable rider under multi‑objective (cost, efficiency, experience) optimization; (3) offering riders decision‑support tools like intelligent voice, route recommendation, and store‑arrival reminders.

The underlying technology stack relies on Meituan's public components and services to achieve partition scaling, disaster recovery, and monitoring. Front‑end traffic is load‑balanced by HLB; services communicate within partitions via OCTO for registration, discovery, load balancing, fault tolerance, and gray releases, while message queues (Kafka, RabbitMQ) are also used. Distributed storage is accessed through Zebra, monitoring through the open‑source CAT system, caching via Squirrel + Cellar, and task scheduling by Crane.

Key challenges addressed include cluster scalability—especially for stateful clusters—resource hotspots, and uneven CPU usage.

To improve scalability, the team transformed stateful nodes into stateless ones and leveraged parallel computation to distribute load across smaller nodes, enabling rapid expansion.

Consistency issues between database writes and cache updates are solved with Databus, a high‑availability, low‑latency, high‑concurrency data‑change transmission system that captures binlog changes and propagates them to Elasticsearch, other databases, or KV stores.

High availability is ensured through three pillars: pre‑incident full‑link stress testing and capacity estimation, periodic health checks and random fault drills (service, machine, component), real‑time anomaly alerts (performance, business metrics, availability), rapid fault localization (single‑machine, cluster, IDC, component, service), and post‑incident rollback, throttling, circuit‑breaking, degradation, and fallback mechanisms.

Single‑IDC Rapid Deployment & Disaster Recovery

After a single‑IDC failure, entry services detect the fault and automatically switch traffic; rapid IDC expansion synchronizes data and pre‑deploys services, opening traffic only when ready. All data‑sync and traffic‑distribution services must support automatic fault detection and removal, enabling IDC‑level scaling.

Multi‑Center Attempts

Meituan groups multiple IDC partitions into virtual centers; services are deployed uniformly across a center. When a center reaches capacity, new IDC units are added to expand.

Unit‑Based Attempts

Compared to multi‑center, unit‑based design offers finer‑grained partition disaster recovery and scaling. Traffic routing is based on regional or city characteristics; data synchronization may experience latency across locations, and SET disaster recovery ensures rapid failover to other SETs.

Core Intelligent Logistics Technologies and Platform Accumulation

The machine‑learning platform provides an end‑to‑end solution for model training and algorithm deployment, addressing the challenges of diverse algorithmic scenarios, duplicated effort, and inconsistent data quality between online and offline environments.

JARVIS is an AIOps platform focused on stability, designed to filter massive duplicate alerts, surface actionable information, and accelerate fault analysis and resolution in large‑scale distributed clusters.

Future Challenges

Future challenges include the ballooning of micro‑services as business complexity grows, network amplification effects caused by minor latency in mesh‑structured service clusters, rapid fault localization in complex topologies, and the shift from cluster‑level to unit‑level operations after unitization, which raises deployment difficulties.

Author Biography

Song Bin, senior technical expert at Meituan, leads the backend of the instant logistics team. He has been involved in distributed system architecture and high‑concurrency stability since 2013, focusing recently on AIOps for large‑scale distributed systems.

distributed systemsmicroservicesOperationsscalabilityhigh availabilityMeituaninstant delivery
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.