Building a General Real‑Time Data Warehouse: Methods and Practices at Meituan Waimai
This article introduces Meituan Waimai's approach to constructing a universal real‑time data warehouse, covering streaming technology choices, Lambda/Kappa architectures, layered design, platformization, SLA management, and a practical Lambda‑style use case for real‑time analytics.
Introduction This article presents a generic method and practice for building a real‑time data warehouse, aiming for end‑to‑end low latency, SQL standardization, rapid response to changes, and unified data.
Real‑time Scenarios In Meituan Waimai, real‑time data is used in operations (business trend analysis, marketing effect), production (system health monitoring), C‑end user services (search and recommendation), and risk control (fraud detection).
1. Real‑time Computing Technology Selection Open‑source streaming engines such as Storm, Spark Streaming, and Flink are compared. Meituan Waimai originally used Storm for its stability and scalability, but is gradually migrating to Flink as it matures.
2. Real‑time Architecture
① Lambda Architecture
Lambda architecture adds a real‑time processing lane to a primarily batch system, resulting in two independent production paths that can cause duplicated logic and resource consumption.
② Kappa Architecture
Kappa unifies batch and real‑time processing into a single pipeline, but its applicability is limited and few industry cases exist.
3. Business Pain Points
Early development often embedded business logic directly into streaming jobs, leading to duplicated data reads, maintenance overhead, and resource explosion as the number of use cases grows.
4. Data Characteristics and Application Scenarios
Log‑type data: massive, semi‑structured, used for monitoring and real‑time feature extraction.
Business‑type data: structured transactional data (binlog) requiring multi‑table joins and stateful processing.
Challenges include multi‑state business processes, complex joins, and the need to batch‑process for analytics.
5. Real‑time Data Warehouse Design
1. Stream‑Batch Combined Architecture
Data flows from unified log collection to a message queue, then through an ETL layer. Log‑type data feeds real‑time dashboards; binlog data feeds real‑time OLAP.
2. Layered Real‑time Warehouse
Source layer: logs and business data.
Real‑time detail layer: unified cleaning, filtering, enrichment, providing ready‑to‑use streams.
Summary layer: lightweight Flink/Storm operators produce aggregated metrics stored centrally.
6. Platform‑level Construction
Functions are abstracted into reusable components (cleaning, filtering, enrichment, encryption, etc.). Users can write custom Java or Python scripts for bespoke transformations.
7. SLA Construction
Both end‑to‑end SLA and job‑level SLA are monitored via lightweight instrumentation and unified reporting.
8. Real‑time OLAP Solution
Meituan Waimai adopts Apache Doris as a high‑performance OLAP engine, enabling fast roll‑back calculations and supporting both historical and today’s partitions in a Lambda‑style workflow.
Real‑time Application Example
A merchant wants to give discounts based on a user’s historical order count. By partitioning a Doris table into historical and today’s data, offline jobs populate the historical partition while real‑time jobs write today’s metrics, allowing a simple query to combine both.
In summary, a layered, component‑driven real‑time data warehouse, combined with a unified SLA framework and Doris‑based OLAP, can efficiently support diverse, low‑latency business needs while keeping resource consumption under control.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.