Building a Real-Time Data Warehouse with Flink at Meituan
Meituan replaced its Storm‑based pipeline with a four‑layer real‑time data warehouse powered by Flink, using hybrid storage (Cellar KV, Elasticsearch, Druid, MySQL) to deliver low‑latency, high‑throughput services, dramatically simplifying SQL‑driven development, unifying metrics, cutting compute costs, and paving the way for offline‑grade accuracy and reliability.
In recent years, enterprises have increasingly demanded real‑time data services. This article summarizes the performance characteristics and applicable scenarios of common real‑time data components, and describes how Meituan built a real‑time data warehouse using the Flink engine to provide efficient and robust real‑time data services.
Figure 1 – Early real‑time data architecture (Storm‑based).
Initially, the real‑time system adopted a “one‑way to the end” development model: Storm jobs processed real‑time queues, extracted metrics, and pushed them directly to real‑time applications. As product and business demands grew, several challenges emerged:
Increasing number of metrics caused severe code coupling due to a “siloed” development approach.
Diverse requirements (detail data vs. OLAP analysis) made a single development model insufficient.
Lack of a comprehensive monitoring system prevented early detection and remediation of issues.
Construction of the Real‑Time Data Warehouse
To address these problems, Meituan adopted a layered design inspired by offline data warehouse practices. The architecture consists of four layers:
1. ODS layer – ingesting binlog, traffic logs, and business real‑time queues. 2. Detail layer – integrating business facts, building real‑time dimension data from both full offline loads and incremental streams. 3. Summary layer – using wide‑table models to enrich detail data with dimensions and aggregate common metrics. 4. App layer – providing services via RPC for specific business needs.
Figure 2 – Layered architecture of the real‑time data warehouse.
Technology Selection – Storage Engine Research
The design uses a hybrid approach: structured data is stored in message queues for streaming consumption, while high‑speed KV stores (Cellar, an internal Tair‑based KV) serve dimension data. For the application layer, storage choices depend on read/write frequency and query complexity:
High‑frequency, simple queries (≈1000 QPS) – Cellar.
Complex queries or need for full‑text search – Elasticsearch.
Low‑frequency OLAP analysis – Druid.
Iterative product development – MySQL.
Figure 3 – Storage tiering in the real‑time warehouse.
Technology Selection – Compute Engine Research
Initially, Storm was used for real‑time processing, but its low‑level API required extensive custom code for common operations (joins, aggregations) and introduced extra dependencies. After evaluating Storm, Flink, and Spark Streaming, Flink was chosen because:
It offers a high‑level API with SQL support.
State management and fault‑tolerance mechanisms meet the required reliability.
Performance tests showed Flink achieving roughly ten times the throughput of Storm.
Figure 4 – Flink vs. Storm comparison.
Flink Usage Insights
Dimension Expansion : Dimension data is fetched via an asynchronous service (Cellar) with sub‑millisecond latency. Asynchronous calls and caching reduce RPC overhead and improve throughput.
Data Joining : Flink’s Table API performs windowed joins, requiring at least one equality condition for grouping. Window size directly impacts memory usage and checkpoint cost; RocksDB state backend and incremental checkpoints are recommended for large windows.
Aggregation : Built‑in aggregations (sum, min, max, avg) are supported. For distinct aggregations, custom UDAFs were developed using MapView (exact), BloomFilter (approximate), and HyperLogLog (very low memory) techniques. State backend selection (RocksDB vs. FsStateBackend) affects serialization overhead for large keys.
Figure 5 – End‑to‑end Flink real‑time table production flow.
Results of the Real‑Time Data Warehouse
By abstracting all processing steps into the layered warehouse, Meituan achieved unified data sources, consistent metric definitions, and the ability to change data definitions without modifying application code. Development effort dropped from ~300 lines of Java per job to a few dozen lines of SQL, shortening development cycles dramatically. Resource usage was optimized per layer (e.g., 1 GB RAM for ODS parsing, larger memory for aggregation layers), leading to lower overall compute costs while maintaining low latency.
Outlook
The goal is to make the real‑time warehouse match offline warehouses in accuracy and consistency, providing reliable data services to merchants, business users, and Meituan customers. Future work includes enhancing data reliability, monitoring, lineage detection, and reducing the learning curve for real‑time data development.
About the Author
Wei‑lun Wei, Real‑Time Data Lead at Meituan’s In‑Store Dining Technology Department, has been focusing on data platforms, real‑time computation, and data architecture since joining Meituan in 2017. He actively promotes Flink’s practical experience in real‑time data processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
