Operations 10 min read

How Meituan Scales Instant Delivery with a Distributed Architecture

Meituan's instant logistics platform evolved over five years, adopting distributed, fault‑tolerant systems, AI‑driven optimization, and multi‑IDC strategies to handle massive order volumes, extreme traffic spikes, and stringent real‑time reliability requirements while continuously improving scalability and cost efficiency.

ITFLY8 Architecture Home

Mar 15, 2021

How Meituan Scales Instant Delivery with a Distributed Architecture

Background

Meituan Waimai has been developing for five years, and its instant logistics exploration has spanned more than three years. The business grew from zero to a sizable scale, accumulating experience in building high‑concurrency distributed systems. The main takeaways are twofold:

Instant logistics tolerates almost no failures or high latency; as business complexity rises, the system must be distributed, scalable, and fault‑tolerant. A staged architectural upgrade eliminated downtime risk.

Focusing on cost, efficiency, and experience, the instant logistics system heavily integrates AI for pricing, ETA calculation, dispatch, capacity planning, subsidies, accounting, voice interaction, LBS mining, operations, monitoring, and more, achieving scale, experience, and cost reduction.

The article outlines technical obstacles and challenges encountered during the evolution of Meituan's instant logistics distributed system architecture:

Massive order and rider scale creates ultra‑large‑scale matching computation.

Holidays or severe weather cause order surges many times the normal peak.

Logistics fulfillment links online to offline, demanding near‑zero failure tolerance, no downtime, and no order loss.

Real‑time, accurate data is required; the system is highly sensitive to latency and anomalies.

Meituan Instant Logistics Architecture

The platform revolves around three core aspects: (1) providing SLA guarantees such as ETA and pricing to users; (2) matching the most suitable rider under multi‑objective (cost, efficiency, experience) optimization; (3) offering riders decision‑support tools like intelligent voice, route recommendation, and store‑arrival reminders.

The underlying distributed system relies on Meituan's common components and services, providing partition scaling, disaster recovery, and monitoring. Front‑end traffic is balanced by HLB. Within a partition, services communicate via OCTO, which offers registration, discovery, load balancing, fault tolerance, and gray‑release capabilities. Message queues such as Kafka and RabbitMQ are also used. Storage accesses a distributed database through Zebra. System and business logs are collected and monitored by the open‑source CAT system. Distributed caching combines Squirrel and Cellar, and task scheduling is handled by Crane.

Key practical challenges include cluster scalability—stateful clusters expand slowly and cannot quickly absorb traffic spikes—and hotspot resources such as uneven CPU usage.

Solutions: transform stateful nodes into stateless ones and leverage parallel computation so smaller nodes share the load, enabling rapid scaling.

Consistency is addressed with Databus, a high‑availability, low‑latency, high‑concurrency system that reliably propagates database changes in real time, ensuring cache and database stay synchronized.

High availability is ensured through pre‑capacity testing, periodic health checks, chaos engineering (random fault injection), real‑time alerting, rapid fault localization (single‑machine, cluster, IDC, component, service), systematic change collection, and post‑incident actions such as rollback, throttling, circuit breaking, degradation, and fallback mechanisms.

Single IDC Rapid Deployment & Disaster Recovery

After a single IDC failure, entrance services detect the fault and automatically switch traffic. Rapid IDC scaling involves pre‑synchronizing data and pre‑deploying services; traffic is opened only after the services are ready. All data‑sync and traffic‑distribution services must automatically detect faults and be removable. Scaling and shrinking are performed per IDC.

Multi‑Center Attempts

Meituan groups multiple IDC resources into a virtual center, treating the center as a partition unit. Services are deployed uniformly across the center. When center capacity is insufficient, new IDC resources are added to expand capacity.

Unitization Attempts

Compared with multi‑center, unitization offers a superior solution for partition disaster recovery and scaling. Traffic routing is based on business characteristics, using regional or city routing. Data synchronization across locations may experience latency. SET disaster recovery ensures that if a local or remote SET fails, another SET can quickly take over the traffic.

Core Intelligent Logistics Technology and Platform Accumulation

The machine‑learning platform is an end‑to‑end solution for model training and algorithm deployment, addressing the challenges of numerous algorithmic scenarios, redundant development, and inconsistent online/offline data quality.

JARVIS is an AIOps platform focused on stability, reducing alarm noise, and automating fault analysis to improve efficiency and reliability in distributed cluster operations.

Future Challenges

After review, several major challenges lie ahead: microservices are no longer “micro” as business complexity grows, leading to service bloat; mesh‑style service clusters amplify even slight latency; complex topologies make rapid fault localization difficult, a key focus for AIOps; and unitization shifts operations from cluster‑level to unit‑level, posing significant deployment challenges for Meituan’s business.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Microservices High Availability AIOps ai-optimization real-time logistics

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.