How Meituan Built a Scalable Real‑Time Data Warehouse with Flink
This article explains Meituan's real‑time data warehouse architecture, covering typical business scenarios, the evolution of its streaming platform, key design challenges, solutions such as unified data models, SQL‑based development, UDF hosting, operator optimizations, and future plans for incremental processing and unified batch‑stream semantics.
Background and Real‑time Use Cases
Meituan’s local‑life services (food delivery, hotel booking, Meituan Select, etc.) depend on sub‑second data for a variety of internal applications. Typical scenarios include:
Metric monitoring – real‑time dashboards that show daily health metrics.
Real‑time feature generation – search, ad‑CTR prediction, rider scheduling, where feature freshness is critical.
Event‑driven processing – risk control, promotional coupon distribution, and other trigger‑based workflows.
Data reconciliation – aligning payment records between finance and business units to avoid financial loss.
Platform Evolution
The real‑time computing platform was launched in 2014 with Storm and Spark Streaming and a first‑generation job‑hosting service. In 2017 Flink was introduced, and by 2019 Flink SQL became the primary programming interface, shifting development from a task‑centric to a data‑centric model. Current focus areas are incremental warehouse production, unified stream‑batch semantics, and a common modeling approach.
Key Challenges in Real‑time Warehouse Construction
High development and O&M cost caused by frequent framework upgrades.
Mismatch between local development and online debugging.
Inconsistent data contracts across business units.
Absence of unified warehouse construction standards, leading to redundancy and resource waste.
To mitigate these issues the platform provides standard ETL templates, a web‑based integrated development environment (IDE), extended SQL capabilities, data‑quality tooling, and strict latency monitoring.
Architecture Overview
The system is organized into three logical layers:
Base services – storage (Hive, Kafka, Redis), compute (Flink), scheduling, and logging.
Middleware – job templates, UDF hosting, metadata management, monitoring, and data‑quality services.
Upper layer – composable micro‑services (job‑template service, UDF service, metadata service) that can be used directly or through an integrated development platform.
Flink was selected over Storm because it offers sub‑second latency, exactly‑once semantics, higher throughput in benchmark tests, and a mature SQL API.
Design Principles and Practices
Unified data model – Hive tables, Kafka topics, and Redis domains are exposed as logical tables, allowing developers to switch between batch and streaming contexts without changing the schema.
SQL‑first development – the same SQL statement can be executed in batch or streaming mode, eliminating code duplication.
Adapter module – abstracts heterogeneous source formats (MySQL binlog, nginx logs, custom SDK logs) into a common schema, reducing integration effort.
UDF hosting service – central repository for user‑defined functions; code is compiled, pre‑checked for safety, and shared across projects.
Release pipeline – each job passes an automated TestCase suite before publishing; failures block deployment to ensure data quality.
Operator and Operator‑level Optimizations
Massive join and aggregation workloads generate high I/O to external KV stores (Redis, HBase). The platform introduces a multi‑level local cache that:
Pre‑processes input streams to drop superseded events.
Caches state updates within the computation stage, merging rapid, duplicate updates into a single write.
Deduplicates downstream messages before emission, reducing network traffic and downstream processing load.
This three‑stage optimization (input preprocessing → computation‑stage caching → event emission deduplication) cuts external I/O by an order of magnitude in high‑throughput scenarios.
Data Quality and Latency Monitoring
A dedicated pipeline executes a MiniCluster‑based Flink job for each release. Workers run the job, store results in a database, and expose logs and metrics through the UI. TestCases act like unit tests; only when all pass is the job promoted.
End‑to‑end latency is measured using Flink’s LatencyMarker (similar to Watermark). The platform controls marker emission frequency based on traffic and required latency granularity, forwards markers across tasks, and reports latency metrics to the internal Raptor monitoring system.
Future Directions
Improve runtime scalability for ultra‑large jobs (higher scheduling success rate, faster state access).
Support high‑availability requirements for consumer‑facing (ToC) scenarios.
Achieve true stream‑batch unification with a single semantics, execution layer, and storage backend.
Develop incremental warehouse production to maximize resource efficiency and enable deterministic, low‑cost processing.
The platform continues to evolve to meet the growing demands of sub‑second data processing across Meituan’s diverse services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
