Big Data 16 min read

Building a General Real‑Time Data Warehouse: Methods and Practices at Meituan Waimai

This article introduces Meituan Waimai's approach to constructing a universal real‑time data warehouse, covering streaming technology choices, Lambda/Kappa architectures, layered design, platformization, SLA management, and a practical Lambda‑style use case for real‑time analytics.

DataFunTalk
DataFunTalk
DataFunTalk
Building a General Real‑Time Data Warehouse: Methods and Practices at Meituan Waimai

Introduction This article presents a generic method and practice for building a real‑time data warehouse, aiming for end‑to‑end low latency, SQL standardization, rapid response to changes, and unified data.

Real‑time Scenarios In Meituan Waimai, real‑time data is used in operations (business trend analysis, marketing effect), production (system health monitoring), C‑end user services (search and recommendation), and risk control (fraud detection).

1. Real‑time Computing Technology Selection Open‑source streaming engines such as Storm, Spark Streaming, and Flink are compared. Meituan Waimai originally used Storm for its stability and scalability, but is gradually migrating to Flink as it matures.

2. Real‑time Architecture

① Lambda Architecture

Lambda architecture adds a real‑time processing lane to a primarily batch system, resulting in two independent production paths that can cause duplicated logic and resource consumption.

② Kappa Architecture

Kappa unifies batch and real‑time processing into a single pipeline, but its applicability is limited and few industry cases exist.

3. Business Pain Points

Early development often embedded business logic directly into streaming jobs, leading to duplicated data reads, maintenance overhead, and resource explosion as the number of use cases grows.

4. Data Characteristics and Application Scenarios

Log‑type data: massive, semi‑structured, used for monitoring and real‑time feature extraction.

Business‑type data: structured transactional data (binlog) requiring multi‑table joins and stateful processing.

Challenges include multi‑state business processes, complex joins, and the need to batch‑process for analytics.

5. Real‑time Data Warehouse Design

1. Stream‑Batch Combined Architecture

Data flows from unified log collection to a message queue, then through an ETL layer. Log‑type data feeds real‑time dashboards; binlog data feeds real‑time OLAP.

2. Layered Real‑time Warehouse

Source layer: logs and business data.

Real‑time detail layer: unified cleaning, filtering, enrichment, providing ready‑to‑use streams.

Summary layer: lightweight Flink/Storm operators produce aggregated metrics stored centrally.

6. Platform‑level Construction

Functions are abstracted into reusable components (cleaning, filtering, enrichment, encryption, etc.). Users can write custom Java or Python scripts for bespoke transformations.

7. SLA Construction

Both end‑to‑end SLA and job‑level SLA are monitored via lightweight instrumentation and unified reporting.

8. Real‑time OLAP Solution

Meituan Waimai adopts Apache Doris as a high‑performance OLAP engine, enabling fast roll‑back calculations and supporting both historical and today’s partitions in a Lambda‑style workflow.

Real‑time Application Example

A merchant wants to give discounts based on a user’s historical order count. By partitioning a Doris table into historical and today’s data, offline jobs populate the historical partition while real‑time jobs write today’s metrics, allowing a simple query to combine both.

In summary, a layered, component‑driven real‑time data warehouse, combined with a unified SLA framework and Doris‑based OLAP, can efficiently support diverse, low‑latency business needs while keeping resource consumption under control.

big dataFlinkstream processingreal-time data warehouselambda architecturestormKappa architectureDoris OLAP
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.