Backend Development 12 min read

How Meituan Scaled Instant Logistics with Distributed Systems and AI

This article details Meituan's five‑year journey building a high‑availability, low‑latency instant logistics platform, describing the distributed architecture evolution, AI‑driven optimizations, fault‑tolerance techniques, and future challenges in scaling micro‑services for massive order and rider volumes.

ITFLY8 Architecture Home

Aug 17, 2021

How Meituan Scaled Instant Logistics with Distributed Systems and AI

Background

Meituan Waimai has been developing for five years, and its instant logistics exploration spans more than three years. The business grew from zero to a sizable scale, accumulating experience in building distributed high‑concurrency systems. The main takeaways are twofold: the instant logistics business tolerates almost no failures or high latency, requiring distributed, scalable, and fault‑tolerant capabilities; and, centered on cost, efficiency, and experience, the system heavily integrates AI for pricing, ETA, dispatch, capacity planning, subsidies, accounting, voice interaction, LBS mining, operations, and monitoring, achieving scale, experience preservation, and cost reduction.

The instant logistics service has extremely low tolerance for faults and high latency; as business complexity rises, the system must be distributed, scalable, and disaster‑recoverable, ultimately eliminating downtime risk.

Focusing on cost, efficiency, and experience, the instant logistics system combines AI across pricing, ETA, scheduling, capacity planning, subsidies, accounting, voice interaction, LBS mining, operations, and monitoring to boost scale, maintain experience, and cut costs.

This article mainly introduces the technical obstacles and challenges encountered during the layered evolution of Meituan's instant logistics distributed system architecture:

Massive order and rider scale leading to ultra‑large‑scale matching computations.

Holiday or severe weather spikes causing traffic peaks dozens of times higher than normal.

Logistics fulfillment is the critical link between online and offline; fault tolerance is extremely low—no downtime, no lost orders, and very high availability is required.

Real‑time, accurate data demands high sensitivity to latency and anomalies.

Meituan Instant Logistics Architecture

The platform focuses on three aspects: (1) providing users with SLA guarantees such as ETA calculation and delivery fee pricing; (2) under multi‑objective (cost, efficiency, experience) optimization, matching the most suitable rider; (3) offering riders decision‑support tools including intelligent voice, route recommendation, and store‑arrival reminders.

Behind these services lies Meituan's powerful technical system, forming a delivery architecture built on distributed systems that must ensure high availability and high concurrency.

Distributed architecture, as opposed to centralized, follows the CAP theorem (Consistency, Availability, Partition Tolerance). In a distributed setup, a service is deployed on multiple peer nodes that communicate over the network, forming a cluster to provide highly available, consistent services.

Initially, Meituan used vertical services per business domain; later, for availability, it adopted a layered service architecture; as complexity grew, it evolved to microservices, adhering to the principle of not jumping into microservice design too early—good architecture evolves, not pre‑designed.

Distributed System Practices

The typical Meituan distributed system structure relies on public components and services to achieve partition scaling, disaster recovery, and monitoring. Front‑end traffic is distributed and load‑balanced by HLB. Within a partition, services communicate via OCTO, providing service registration, auto‑discovery, load balancing, fault tolerance, gray release, etc. Message queues such as Kafka and RabbitMQ can also be used. The storage layer accesses a distributed database through Zebra. Monitoring and logging are handled by the open‑source CAT system. Distributed caching uses a Squirrel + Cellar combo, and task scheduling is performed by Crane.

Practical challenges include cluster scalability—stateful clusters scale poorly and can cause node hotspots, resource and CPU imbalance.

Solutions implemented:

Convert stateful nodes to stateless ones and leverage parallel computation, allowing small business nodes to share load and achieve rapid scaling.

Address consistency by using Databus, a high‑availability, low‑latency, high‑concurrency system that reliably transmits database changes. Databus monitors upstream Binlog changes and pipes them to Elasticsearch, other databases, or KV systems, ensuring eventual data synchronization.

Ensure high availability through three pillars: pre‑incident full‑link stress testing and capacity estimation; periodic health checks and random fault drills (service, machine, component); real‑time anomaly alerts (performance, business metrics, availability); fast fault localization (single‑machine, cluster, IDC, component, service); and post‑incident actions such as rollback, throttling, circuit breaking, degradation, and fallback mechanisms.

Rapid Deployment & Disaster Recovery for a Single IDC

After a single IDC failure, the entrance service detects the fault and automatically switches traffic. Rapid IDC expansion involves pre‑synchronizing data, pre‑deploying services, and opening traffic only when the service is ready. All data‑sync and traffic‑distribution services must support automatic fault detection and removal, enabling scaling per IDC.

Multi‑Center Attempts

Meituan groups multiple IDC partitions into a virtual center; services are deployed uniformly across the center. When a center's capacity is insufficient, a new IDC is added to expand capacity.

Unitization Attempts

Compared to multi‑center, unitization offers a superior solution for partition disaster recovery and scaling. Traffic routing is based on business characteristics, using regions or cities. Data synchronization across locations may experience latency. SET disaster recovery ensures that if a local or remote SET fails, traffic can be quickly switched to another SET.

Core Technical Capabilities and Platform Foundations of Smart Logistics

The machine‑learning platform is a one‑stop solution for model training and algorithm deployment, addressing the contradictions of numerous algorithm scenarios, repetitive development, and inconsistent online/offline data quality. Without a clear, coherent process, iteration efficiency drops and data‑quality issues hinder feature and model deployment.

JARVIS is an AIOps platform aimed at stability. It tackles problems such as alarm flooding, duplicate alerts, and difficulty extracting useful information. Historically, small‑scale distributed cluster faults relied on manual analysis, leading to low efficiency and unstable outcomes. JARVIS automates alarm aggregation, root‑cause analysis, and rapid response, improving fault handling speed and reliability.

Future Challenges

After review, the team foresees significant challenges: microservices are no longer "micro" as business complexity grows, causing service bloat; the mesh network structure amplifies even minor latency; complex service topologies make rapid fault localization difficult, a key focus for AIOps; and unit‑level operations shift maintenance from cluster‑based to unit‑based, posing substantial deployment challenges for Meituan.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems microservices High Concurrency fault tolerance aiops AI logistics

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.