Design and Scaling of Meituan Delivery Real‑Time Feature Platform
This article details how Meituan built a minute‑level, high‑throughput real‑time feature platform for its delivery business, covering the business model, six‑layer architecture, data processing challenges, stability measures, scaling achievements, and future roadmap to support millions of orders per minute with sub‑50 ms latency.
In May 2019 Meituan launched the "Meituan Delivery" brand and upgraded its open delivery platform. To support intelligent decision‑making for order fulfillment, a real‑time feature platform was built to generate millions of features per minute and meet a 70 w+ QPS load with 99.99% response time under 50 ms.
The delivery business links users, merchants, and riders, forming a closed‑loop model that aims to improve efficiency, experience, and cost. The fulfillment process involves multiple physical steps, requiring an intelligent decision system for dispatch, ETA estimation, pricing, and load balancing.
Two main problems drove the platform construction in 2017: (1) transitioning from rule‑based to intelligent, minute‑level data for algorithms, and (2) fragmented development across four teams causing low efficiency and high risk.
The platform’s goals were to provide minute‑level real‑time features, improve development efficiency, and reduce costs. A three‑stage evolution plan was defined: systematization, scaling, and platformization.
Systematization introduced a standardized architecture with six layers: data source, data (ODS/DWD), compute, storage, service, and application. Standards included process standardization, data layering, and feature fallback to reduce risk.
Key challenges in real‑time data processing were out‑of‑order streams and guaranteeing exactly‑once semantics. Solutions involved pre‑building wide‑table templates that are filled in real time, upstream merging without loss, and downstream deduplication.
The compute layer moved from heavy Storm/RPC solutions to lightweight in‑memory SQL processing with region‑based sharding, using H2 for fast feature calculation.
Stability was ensured through a four‑layer monitoring system (hardware, components, services, data quality) and a disaster‑recovery framework with isolation, double‑cache design, capacity planning, circuit‑breakers, and multi‑level degradation. No S‑level incidents occurred.
During scaling, the platform handled 100+ algorithm versions and 200+ real‑time features, achieving 60 w+ QPS with 50 ms latency, and processing over 10 million features per minute. Performance optimizations targeted I/O (batching, protobuf), CPU (G1 GC), and memory usage.
Platformization added SDKs for third‑party feature collection, introduced Flink and Storm behind an engine‑routing layer, and expanded data granularity with GeoHash, AOI, weather, and trajectory features. The upgraded architecture integrated new data sources and services while keeping a unified code base.
Future plans focus on data governance (integrating real‑time features with other real‑time data warehouses), full‑chain data quality, and consolidating compute engines to lower operational costs.
The presentation concluded with thanks and a call for audience engagement.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.