How Meituan Scaled Instant Logistics with a Distributed Cloud‑Native Architecture
This article explains how Meituan's instant logistics platform evolved over five years, detailing the distributed, high‑concurrency system design, AI‑driven optimizations, multi‑center and unit‑based deployment strategies, and the operational challenges and solutions for achieving high availability and low latency.
Background
Meituan's food delivery business has been operating for five years, and its instant logistics service has been explored for more than three years. During this period, the team accumulated experience in building distributed high‑concurrency systems. The main takeaways are twofold: the logistics service requires extremely low tolerance for failures and latency, demanding a distributed, scalable, and disaster‑tolerant architecture; and, by integrating AI across pricing, ETA, dispatch, capacity planning, subsidies, accounting, voice interaction, LBS mining, operations, and monitoring, the system achieved scale, experience, and cost reductions.
Massive order and rider scale creates ultra‑large‑scale matching computations.
Holiday or severe weather spikes cause traffic to surge many times above normal.
Logistics fulfillment links online to offline, demanding near‑zero downtime and extremely high availability.
Real‑time, accurate data is highly sensitive to latency and anomalies.
Meituan Instant Logistics Architecture
The platform focuses on three aspects: providing users with SLA guarantees such as ETA and pricing; matching the most suitable rider under multi‑objective (cost, efficiency, experience) optimization; and offering riders decision‑support tools like intelligent voice, route recommendation, and store‑arrival reminders.
The underlying technology stack includes a powerful distributed system architecture that ensures high availability and high concurrency.
Distributed System Practice
The typical distributed system structure relies on Meituan's public components and services to achieve partition scaling, disaster recovery, and monitoring. Front‑end traffic is balanced by HLB. Within a partition, services communicate via OCTO for registration, discovery, load balancing, fault tolerance, and gray releases, while message queues such as Kafka or RabbitMQ can also be used. Storage accesses a distributed database through Zebra. Monitoring uses the open‑source CAT system. Distributed caching combines Squirrel and Cellar, and task scheduling is handled by Crane.
Key challenges include cluster scalability—stateful clusters expand slowly, leading to resource hotspots and uneven CPU usage.
To address this, the backend team transformed stateful nodes into stateless ones and leveraged parallel computation, allowing smaller nodes to share the load and achieve rapid scaling.
Consistency is solved with Databus, a high‑availability, low‑latency, high‑concurrency change‑data‑capture system that streams binlog changes to Elasticsearch, other databases, or KV stores, ensuring eventual data consistency across systems.
High availability is ensured through three phases: pre‑incident (full‑link stress testing, capacity estimation, periodic health checks, fault injection), incident (alerting, rapid fault localization, change collection), and post‑incident (rollback, throttling, circuit breaking, degradation, and fallback mechanisms).
Single‑IDC Rapid Deployment & Disaster Recovery
After a single IDC failure, entrance services detect the fault and automatically switch traffic. Rapid scaling involves pre‑synchronizing data, pre‑deploying services, and opening traffic once the new instance is ready. All data‑sync and traffic‑distribution services must support automatic fault detection and removal, and scaling is performed per IDC.
Multi‑Center Attempts
When a partition cannot be expanded due to resource exhaustion, Meituan groups multiple IDC nodes into a virtual center, treating the center as a partition unit. Services are deployed uniformly across the center, and new IDC nodes are added to increase capacity when needed.
Unit‑Based Attempts
Unitization improves partition disaster recovery and scaling compared to multi‑center. Traffic routing is based on regional or city characteristics. Data synchronization may experience latency across locations. SET disaster recovery ensures that if a local or remote SET fails, traffic can be quickly shifted to another SET.
Core Intelligent Logistics Technologies and Platform Consolidation
The machine‑learning platform provides an end‑to‑end solution for model training and algorithm deployment, addressing the challenges of diverse algorithm scenarios, duplicated effort, and inconsistent data quality between online and offline environments.
JARVIS is an AIOps platform aimed at stability, handling massive duplicate alerts, extracting useful information, and improving fault analysis efficiency. It replaces manual, experience‑based troubleshooting with automated, reliable incident handling.
Future Challenges
Future challenges include the growing complexity of microservices, network amplification effects caused by slight latency in mesh‑structured clusters, rapid fault localization in complex topologies, and the shift from cluster‑level to unit‑level operations, which will demand new deployment capabilities.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
