Backend Development 11 min read

Meituan Instant Logistics: Distributed System Architecture Evolution and Challenges

The article details Meituan's five‑year journey in building a highly available, low‑latency distributed instant‑logistics platform, describing its architectural evolution from vertical services to micro‑services, the integration of AI for cost‑efficiency‑experience optimization, and the operational challenges and solutions for scaling, consistency, and fault tolerance.

Top Architect
Top Architect
Top Architect
Meituan Instant Logistics: Distributed System Architecture Evolution and Challenges

Background

Meituan's instant delivery has developed over five years, with more than three years of real‑time logistics exploration, accumulating experience in building distributed high‑concurrency systems. The main take‑aways are the extremely low tolerance for faults and latency, and the extensive use of AI techniques to improve cost, efficiency, and user experience.

Meituan Instant Logistics Architecture

The platform focuses on three core aspects: providing SLA guarantees such as ETA calculation and pricing, matching the most suitable rider under multi‑objective optimization (cost, efficiency, experience), and offering rider‑side decision‑support (voice interaction, route recommendation, store‑arrival reminders).

Behind these services lies a robust technical ecosystem that supports a distributed architecture ensuring high availability and high concurrency.

Distributed System Basics

A distributed architecture, as opposed to a centralized one, follows the CAP theorem (Consistency, Availability, Partition Tolerance). Services are deployed on multiple peer nodes that communicate over the network, forming clusters that provide highly available and consistent services.

Meituan initially adopted vertical services per business domain, later evolved to layered services for availability, and finally transitioned to a micro‑service architecture, emphasizing gradual evolution rather than premature design.

Distributed System Practices

The typical Meituan distributed system structure relies on public components and services for partition scaling, disaster recovery, and monitoring. Front‑end traffic is balanced by HLB; intra‑partition service communication uses OCTO for registration, discovery, load balancing, fault tolerance, and gray releases, while message queues like Kafka or RabbitMQ can also be used. Storage accesses distributed databases via Zebra, monitoring is handled by CAT (Meituan's open‑source distributed monitoring system), caching uses Squirrel+Cellar, and task scheduling is performed by Crane.

Key challenges include cluster scalability (especially for stateful services), hotspot nodes, and uneven resource utilization.

Solutions: Transform stateful nodes into stateless ones and distribute computation across many small nodes for rapid scaling. Ensure data consistency between DB and cache using Databus, a high‑availability, low‑latency, high‑concurrency change‑data‑capture system that propagates binlog changes to downstream stores. Maintain high availability through full‑link stress testing, periodic health checks, random fault drills, real‑time alerts, rapid fault localization, and post‑incident rollback, throttling, circuit breaking, and degradation mechanisms.

Single‑IDC Rapid Deployment & Disaster Recovery

After a single IDC failure, entry services detect the fault and automatically switch traffic; rapid scaling synchronizes data and pre‑deploys services before opening traffic. All data‑sync and traffic‑distribution services must support automatic fault detection and removal, and scaling is performed per IDC.

Multi‑Center Attempts

Meituan groups multiple IDC partitions into virtual centers; services are deployed uniformly across centers. When a center reaches capacity, new IDC resources are added to expand capacity.

Unit‑Based Attempts

Unit‑based deployment offers finer‑grained disaster recovery and scaling compared to multi‑center. Traffic routing is based on regions or cities, while data synchronization may experience cross‑region latency. SET disaster recovery ensures rapid failover to other SETs when needed.

Core Intelligent Logistics Technologies and Platform Consolidation

The machine‑learning platform provides an end‑to‑end solution for model training and algorithm deployment, addressing repeated development and data quality inconsistencies across online and offline environments.

JARVIS is an AIOps platform focused on stability, aggregating noisy alerts, reducing duplicate notifications, and enabling faster fault analysis and resolution for large‑scale distributed clusters.

Future Challenges

Future challenges include the growing complexity of micro‑services, network amplification effects from minor latency, rapid fault localization in intricate service topologies, and the shift from cluster‑level to unit‑level operations, which demands more sophisticated deployment capabilities.

Author Bio Song Bin, senior technical expert at Meituan, leads the backend of the instant logistics team, focusing on distributed system architecture, high‑concurrency stability, and AIOps research.
distributed systemsbackend architecturemicroserviceshigh availabilityMeituanreal-time logistics
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.