Operations 12 min read

How Meituan Scaled Instant Logistics with Distributed Architecture and AI

This article details Meituan's five‑year journey building a high‑availability, low‑latency instant logistics platform, covering distributed system evolution, AI‑driven optimization, fault‑tolerance techniques, and future challenges in scaling micro‑services and AIOps.

Java High-Performance Architecture

Feb 28, 2022

How Meituan Scaled Instant Logistics with Distributed Architecture and AI

Background

Meituan Waimai has been developing for five years, and instant logistics for more than three years. The business grew from zero to scale, accumulating experience in building distributed high‑concurrency systems. Two main takeaways are:

Instant logistics tolerates almost no failure or high latency; as complexity grows the system must be distributed, scalable, and fault‑tolerant, which eventually eliminated downtime.

Focusing on cost, efficiency, and experience, the instant‑logistics platform heavily integrates AI in pricing, ETA, dispatch, capacity planning, subsidies, accounting, voice interaction, LBS mining, operations, and monitoring, achieving scale, experience, and cost reduction.

The article introduces technical obstacles and challenges encountered during the layered evolution of Meituan’s instant‑logistics distributed architecture:

Massive order and rider scale creates ultra‑large‑scale matching computation.

Holiday or severe weather causes order spikes many times the normal peak.

Logistics fulfillment links online to offline; failure tolerance is extremely low, requiring high availability and no lost orders.

Real‑time, accurate data demands low latency and high sensitivity to anomalies.

Meituan Instant‑Logistics Architecture

The platform focuses on three aspects: (1) providing SLA for users, including ETA calculation and delivery‑fee pricing; (2) matching the most suitable rider under multi‑objective (cost, efficiency, experience) optimization; (3) offering riders decision‑support such as intelligent voice, route recommendation, and store‑arrival reminders.

Behind these services is a powerful technical system that relies on a distributed architecture to guarantee high availability and high concurrency.

Distributed architecture follows the CAP theorem (Consistency, Availability, Partition tolerance). Services are deployed on multiple peer nodes that communicate over the network to form a cluster providing consistent, highly available services.

Initially Meituan used vertical services per business domain; later, to improve availability, a layered service architecture was introduced, and eventually a micro‑service architecture emerged, following the principle that good architecture evolves rather than being designed prematurely.

Distributed System Practice

The typical Meituan distributed system structure relies on public components and services to achieve partition scaling, disaster recovery, and monitoring. Front‑end traffic is balanced by HLB; services within a partition communicate via OCTO for registration, discovery, load balancing, fault tolerance, and gray release. Message queues such as Kafka or RabbitMQ can also be used. Storage accesses distributed databases via Zebra. Monitoring and logging are handled by the open‑source CAT system. Distributed cache uses a Squirrel+Cellar combo, and task scheduling is performed by Crane.

Key challenges include cluster scalability—stateful clusters scale poorly—and node hotspot issues such as uneven resource or CPU usage.

To address scalability, the backend team turned stateful nodes into stateless ones and leveraged parallel computation, allowing small service nodes to share load and enabling rapid expansion.

Consistency between database and cache writes is ensured by Databus, a high‑availability, low‑latency, high‑concurrency system that streams binlog changes to downstream stores (ES, other DBs, KV systems), guaranteeing eventual data consistency.

High availability is achieved through three stages: pre‑incident full‑link stress testing and capacity estimation; periodic health checks and random fault drills (service, machine, component); real‑time alerts on performance, business metrics, and availability; rapid fault localization (single‑machine, cluster, IDC, component, service); and post‑incident actions such as rollback, throttling, circuit breaking, degradation, and fallback mechanisms.

Single‑IDC Rapid Deployment & Disaster Recovery

After a single‑IDC failure, entry services detect the fault and automatically switch traffic. Rapid scaling synchronizes data and deploys services ahead of time, opening traffic only when ready. All data‑sync and traffic‑distribution services must support automatic fault detection and removal, and scaling is performed per IDC.

Multi‑Center Attempts

When a partition cannot expand due to resource saturation, multiple IDC units form a virtual center; services are deployed uniformly across the center. If capacity is insufficient, new IDC units are added to expand.

Unitization Attempts

Compared with multi‑center, unitization offers a finer‑grained disaster‑recovery and scaling solution. Traffic routing is based on region or city. Data synchronization may experience latency across locations. SET disaster recovery ensures that if a local or remote SET fails, traffic can be quickly shifted to another SET.

Core Intelligent‑Logistics Capabilities and Platform Accumulation

The machine‑learning platform provides an end‑to‑end environment for model training and algorithm deployment, solving the problems of diverse algorithm scenarios, duplicated effort, and inconsistent data quality between online and offline.

JARVIS is an AIOps platform focused on stability, handling massive duplicate alerts, and improving fault‑analysis efficiency. It replaces manual, experience‑based troubleshooting with automated, reliable incident detection and resolution.

Future Challenges

Reviewing the current system reveals upcoming challenges: micro‑services are no longer “micro” as business complexity grows, leading to service bloat; mesh‑style service clusters amplify even slight latency; complex topologies make rapid fault location difficult, a key focus for AIOps; and unit‑based operations shift maintenance from cluster‑level to unit‑level, demanding new deployment capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Concurrency AI Ops instant logistics

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.