Backend Development 12 min read

Evolution of Meituan Instant Logistics Distributed Architecture and Operational Practices

The article describes Meituan's five‑year journey in instant logistics, detailing the challenges of massive order‑rider matching, high traffic spikes, ultra‑low latency requirements, and how a layered, micro‑service‑based distributed architecture combined with AI techniques was progressively adopted to achieve scalability, reliability, and cost efficiency.

Architecture Digest

Feb 26, 2022

Evolution of Meituan Instant Logistics Distributed Architecture and Operational Practices

Background

Meituan's food delivery service has been operating for five years, and its instant logistics platform has been explored for over three years, accumulating experience in building distributed high‑concurrency systems. The main takeaways are twofold: the business tolerates almost no failures or high latency, demanding distributed, scalable, and fault‑tolerant systems; and the logistics system heavily integrates AI across pricing, ETA, dispatch, capacity planning, subsidies, accounting, voice interaction, LBS mining, operations, and monitoring to boost scale, preserve experience, and reduce cost.

The article introduces the technical obstacles and challenges encountered during the layered evolution of Meituan's instant logistics distributed architecture:

Massive order and rider scale leading to ultra‑large‑scale matching computations.

Holiday or adverse weather causing traffic spikes many times higher than normal.

Logistics fulfillment is the critical link between online and offline, with near‑zero tolerance for downtime or lost orders.

Stringent real‑time data accuracy and latency requirements.

Meituan Instant Logistics Architecture

The platform focuses on three aspects: providing SLA to users (ETA calculation and delivery fee pricing), matching the most suitable rider under multi‑objective (cost, efficiency, experience) optimization, and offering riders decision‑support tools such as intelligent voice, route recommendation, and store arrival reminders.

Behind these services lies Meituan's robust technical foundation, comprising platforms, algorithms, systems, and services that rely on a distributed architecture to ensure high availability and high concurrency.

Distributed architecture, as opposed to centralized, adheres to the CAP theorem (Consistency, Availability, Partition tolerance). Services are deployed across multiple peer nodes that communicate over the network, forming clusters that provide highly available and consistent services.

Initially, Meituan used vertical service architectures per business domain; later, to improve availability, it adopted layered services, and eventually evolved to micro‑services, following the principle that good architecture emerges through evolution rather than premature design.

Distributed System Practices

The typical Meituan distributed system structure relies on public components and services to provide partition scaling, disaster recovery, and monitoring. Front‑end traffic is balanced by HLB; within a partition, services communicate via OCTO for service registration, discovery, load balancing, fault tolerance, and gray releases, while message queues like Kafka or RabbitMQ can also be used. Storage accesses distributed databases via Zebra. Monitoring is handled by the open‑source CAT system. Distributed caching uses a Squirrel+Cellar combo, and task scheduling is performed by Crane.

Key challenges include cluster scalability—stateful clusters scale poorly, leading to slow expansion and node hotspot issues such as uneven resource or CPU usage.

To address these, the backend team transformed stateful nodes into stateless ones, leveraging parallel computation to distribute load across smaller nodes for rapid scaling.

Consistency issues in write‑through scenarios (DB and cache) are mitigated using Databus, a high‑availability, low‑latency, high‑concurrency system that streams database changes (Binlog) to downstream systems like Elasticsearch, other DBs, or KV stores, ensuring eventual data consistency.

High availability is ensured through three pillars: pre‑incident full‑link stress testing and capacity estimation, periodic health checks and random fault injection (service, machine, component), and in‑incident anomaly alerts (performance, business metrics, availability) with rapid fault isolation (single‑machine, cluster, IDC, component, service). Post‑incident actions include system rollback, throttling, circuit breaking, degradation, and fallback mechanisms.

Single‑IDC Rapid Deployment & Disaster Recovery

After a single IDC failure, entry services detect the fault and automatically switch traffic; rapid IDC expansion involves pre‑synchronizing data, pre‑deploying services, and opening traffic once ready. All data‑sync and traffic‑distribution services must support automatic fault detection and removal, enabling IDC‑level scaling.

Multi‑Center Attempts

When a partition reaches capacity limits, Meituan groups multiple IDC resources into a virtual center, deploying services uniformly across the center. If capacity is insufficient, new IDC units are added to expand.

Unit‑Based Attempts

Compared to multi‑center, unit‑based designs offer superior partition disaster recovery and scaling. Traffic routing is based on regions or cities; data synchronization may experience latency across locations. SET disaster recovery ensures that if a local or remote SET fails, traffic can be quickly shifted to another SET.

Core Technical Capabilities and Platform Accumulation for Intelligent Logistics

The machine‑learning platform provides an end‑to‑end solution for model training and algorithm deployment, addressing challenges of diverse algorithm scenarios, redundant development, and inconsistent online/offline data quality.

JARVIS is an AIOps platform focused on stability, handling massive duplicate alerts, and improving fault analysis efficiency by automating detection, correlation, and response.

Future Challenges

Future challenges include the growing complexity of micro‑services, network amplification effects from minor latency, rapid fault localization in complex service topologies, and the shift from cluster‑level to unit‑level operations after unitization, all of which demand advanced AIOps solutions.

Author Bio

Song Bin, senior technical expert at Meituan, has been involved in distributed system architecture and high‑concurrency stability for years, currently leading the backend of the instant logistics team. He focuses on AIOps to enhance system stability in high‑concurrency, distributed environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems backend architecture Scalability High Concurrency AI integration instant logistics

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.