Evolution of Meituan Instant Logistics Distributed Architecture and Operational Practices
The article describes Meituan's five‑year journey in instant logistics, detailing the challenges of massive order‑rider matching, high traffic spikes, ultra‑low latency requirements, and how a layered, micro‑service‑based distributed architecture combined with AI techniques was progressively adopted to achieve scalability, reliability, and cost efficiency.
Background
Meituan's food delivery service has been operating for five years, and its instant logistics platform has been explored for over three years, accumulating experience in building distributed high‑concurrency systems. The main takeaways are twofold: the business tolerates almost no failures or high latency, demanding distributed, scalable, and fault‑tolerant systems; and the logistics system heavily integrates AI across pricing, ETA, dispatch, capacity planning, subsidies, accounting, voice interaction, LBS mining, operations, and monitoring to boost scale, preserve experience, and reduce cost.
The article introduces the technical obstacles and challenges encountered during the layered evolution of Meituan's instant logistics distributed architecture:
Massive order and rider scale leading to ultra‑large‑scale matching computations.
Holiday or adverse weather causing traffic spikes many times higher than normal.
Logistics fulfillment is the critical link between online and offline, with near‑zero tolerance for downtime or lost orders.
Stringent real‑time data accuracy and latency requirements.
Meituan Instant Logistics Architecture
The platform focuses on three aspects: providing SLA to users (ETA calculation and delivery fee pricing), matching the most suitable rider under multi‑objective (cost, efficiency, experience) optimization, and offering riders decision‑support tools such as intelligent voice, route recommendation, and store arrival reminders.
Behind these services lies Meituan's robust technical foundation, comprising platforms, algorithms, systems, and services that rely on a distributed architecture to ensure high availability and high concurrency.
Distributed architecture, as opposed to centralized, adheres to the CAP theorem (Consistency, Availability, Partition tolerance). Services are deployed across multiple peer nodes that communicate over the network, forming clusters that provide highly available and consistent services.
Initially, Meituan used vertical service architectures per business domain; later, to improve availability, it adopted layered services, and eventually evolved to micro‑services, following the principle that good architecture emerges through evolution rather than premature design.
Distributed System Practices
The typical Meituan distributed system structure relies on public components and services to provide partition scaling, disaster recovery, and monitoring. Front‑end traffic is balanced by HLB; within a partition, services communicate via OCTO for service registration, discovery, load balancing, fault tolerance, and gray releases, while message queues like Kafka or RabbitMQ can also be used. Storage accesses distributed databases via Zebra. Monitoring is handled by the open‑source CAT system. Distributed caching uses a Squirrel+Cellar combo, and task scheduling is performed by Crane.
Key challenges include cluster scalability—stateful clusters scale poorly, leading to slow expansion and node hotspot issues such as uneven resource or CPU usage.
To address these, the backend team transformed stateful nodes into stateless ones, leveraging parallel computation to distribute load across smaller nodes for rapid scaling.
Consistency issues in write‑through scenarios (DB and cache) are mitigated using Databus, a high‑availability, low‑latency, high‑concurrency system that streams database changes (Binlog) to downstream systems like Elasticsearch, other DBs, or KV stores, ensuring eventual data consistency.
High availability is ensured through three pillars: pre‑incident full‑link stress testing and capacity estimation, periodic health checks and random fault injection (service, machine, component), and in‑incident anomaly alerts (performance, business metrics, availability) with rapid fault isolation (single‑machine, cluster, IDC, component, service). Post‑incident actions include system rollback, throttling, circuit breaking, degradation, and fallback mechanisms.
Single‑IDC Rapid Deployment & Disaster Recovery
After a single IDC failure, entry services detect the fault and automatically switch traffic; rapid IDC expansion involves pre‑synchronizing data, pre‑deploying services, and opening traffic once ready. All data‑sync and traffic‑distribution services must support automatic fault detection and removal, enabling IDC‑level scaling.
Multi‑Center Attempts
When a partition reaches capacity limits, Meituan groups multiple IDC resources into a virtual center, deploying services uniformly across the center. If capacity is insufficient, new IDC units are added to expand.
Unit‑Based Attempts
Compared to multi‑center, unit‑based designs offer superior partition disaster recovery and scaling. Traffic routing is based on regions or cities; data synchronization may experience latency across locations. SET disaster recovery ensures that if a local or remote SET fails, traffic can be quickly shifted to another SET.
Core Technical Capabilities and Platform Accumulation for Intelligent Logistics
The machine‑learning platform provides an end‑to‑end solution for model training and algorithm deployment, addressing challenges of diverse algorithm scenarios, redundant development, and inconsistent online/offline data quality.
JARVIS is an AIOps platform focused on stability, handling massive duplicate alerts, and improving fault analysis efficiency by automating detection, correlation, and response.
Future Challenges
Future challenges include the growing complexity of micro‑services, network amplification effects from minor latency, rapid fault localization in complex service topologies, and the shift from cluster‑level to unit‑level operations after unitization, all of which demand advanced AIOps solutions.
Author Bio
Song Bin, senior technical expert at Meituan, has been involved in distributed system architecture and high‑concurrency stability for years, currently leading the backend of the instant logistics team. He focuses on AIOps to enhance system stability in high‑concurrency, distributed environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
