Big Data 29 min read

How to Build a Real‑Time Data Warehouse with Flink: Principles, Architecture, and Best Practices

This article explains why real‑time data warehouses are needed, outlines their core principles, compares them with offline warehouses, describes typical use cases such as real‑time OLAP, dashboards, feature generation and monitoring, and provides a step‑by‑step guide to designing, implementing, and operating a Flink‑based streaming warehouse with Kafka, HBase, and metadata management.

dbaplus Community
dbaplus Community
dbaplus Community
How to Build a Real‑Time Data Warehouse with Flink: Principles, Architecture, and Best Practices

Purpose and Principles of a Real‑Time Data Warehouse

Real‑time warehouses complement offline warehouses by providing low‑latency access to the most recent data (typically a three‑day window). They do not replace historical batch processing; instead they solve problems that offline warehouses cannot meet due to timeliness constraints.

Two guiding principles are applied:

Do not duplicate functionality already solved by the offline warehouse (e.g., month‑level historical aggregates).

Avoid forcing a warehouse model onto workloads that are inherently unsuitable for warehousing, such as ultra‑high‑frequency transactional processing.

Like offline warehouses, a real‑time warehouse must be topic‑oriented, integrated, and relatively stable.

Typical Real‑Time Use Cases

Real‑time OLAP : Extend existing OLAP tools to query fresh data without redesign.

Live dashboards : Power screens for flash‑sale monitoring or daily store performance.

Real‑time feature generation : Tag users or merchants with up‑to‑date attributes (e.g., “high‑value user”).

Business KPI monitoring : Detect drops instantly to reduce loss.

Real‑time data warehouse application scenarios
Real‑time data warehouse application scenarios

Concept Mapping Between Offline and Real‑Time Warehouses

Offline development relies on Hive SQL and batch jobs (MapReduce/Spark). The real‑time counterpart uses Flink SQL and continuously running Flink streaming jobs. Offline tables are Hive tables; real‑time tables are abstracted as Stream Tables . Storage shifts from HDFS to Kafka for streams and KV stores (e.g., Tair, HBase) for dimension data.

Concept mapping between offline and real‑time warehouses
Concept mapping between offline and real‑time warehouses

Overall Architecture

The architecture mirrors the classic ODS → DW → Application layers but reduces the number of intermediate layers to keep latency low. Data is retained for roughly three days, guaranteeing two‑day completeness during batch gaps.

Real‑time data warehouse architecture diagram
Real‑time data warehouse architecture diagram

ODS Layer Construction

Unify data sources as much as possible (e.g., all binlog, traffic logs, and system logs are ingested via Kafka).

Use partitioning to keep data locally ordered and to enable efficient pruning.

Primary sources are Kafka topics fed by CDC (binlog), web traffic logs, and system logs. Consistent source selection avoids out‑of‑order issues.

ODS layer design
ODS layer design

DW Layer Construction

Standardize and clean raw data, align schemas with the offline warehouse, and optionally use configuration‑driven ingestion rules to keep real‑time and offline DW in sync.

DW layer processing flow
DW layer processing flow

Dimension Data Handling

Dimensions are split by update frequency:

Low‑frequency dimensions (e.g., geographic codes) are loaded from offline dimension tables into an in‑memory cache.

High‑frequency dimensions (e.g., price, user status) are captured via change streams and stored as “link tables” in HBase. HBase’s multi‑version capability and TTL keep storage bounded.

Low‑frequency dimension storage
Low‑frequency dimension storage
High‑frequency dimension link table
High‑frequency dimension link table

Summary Layer Construction

Common metrics (PV, UV, revenue) are aggregated here. To avoid excessive state usage, approximate algorithms such as BloomFilter for distinct counting and HyperLogLog for cardinality estimation are recommended.

Flink’s built‑in windows (tumbling, sliding, session) enable fine‑grained time‑based aggregations. Configure TTL on state to prevent memory blow‑up.

When writing aggregated results to Kafka, convert retract streams (which contain both insert and delete records) to append‑only streams, typically by filtering out retraction (false) records.

Summary layer processing
Summary layer processing

Warehouse Quality Assurance

Metadata and Lineage : A metadata service generates a Flink Catalog that supplies table definitions to jobs. DDL parsing updates the metadata store, and job status is written back to support lineage tracing.

Metadata and lineage architecture
Metadata and lineage architecture

Data Quality Validation : Real‑time results are periodically persisted to Hive. Offline validation tools compare these results against batch outputs. Threshold‑based alerts highlight discrepancies, enabling pre‑production checks and continuous improvement.

Key Implementation Details

Programming Model

Use Flink SQL with user‑defined functions (UDFs) just as you would use Hive SQL offline. Streaming jobs run continuously as Flink streaming programs.

Physical Storage Choices

Event streams → Kafka topics.

Dimension data → KV stores (Tair, HBase) with TTL.

Handling Duplicates and Keys

Each record receives a unique key (identifies the exact change) and a primary key (identifies the logical entity). The primary key is used for Kafka partitioning to preserve order; the unique key enables de‑duplication downstream.

Versioning and Batch IDs

When schema evolves, embed a version field in the record so downstream jobs can apply the correct schema. For re‑processing scenarios (e.g., correcting a day's data), add a batch_id to distinguish normal streaming data from re‑imported data, allowing jobs to consume only the intended batch.

Dimension “Link” Tables (Lattice Tables)

High‑frequency dimensions are stored as time‑variant link tables (often called “拉链表”). Each change creates a new row with a timestamp; queries join on the event time to retrieve the correct dimension value.

Dimension Join Strategies

Use a UDTF with LATERAL TABLE to call an external dimension service at query time.

Alternatively, parse the SQL, replace dimension references with a flatMap that performs the join, and materialize the result as a new table. The UDTF approach is simpler for most cases.

Aggregation and State Management

Aggregations produce retract streams; to write to Kafka, filter out retraction records and keep only the insert (true) records. Example pseudo‑code:

SELECT
    key,
    SUM(metric) AS metric_sum,
    COUNT(*) AS cnt
FROM source_table
GROUP BY key;
-- The result is a retract stream.
-- Convert to append stream:
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY key ORDER BY proc_time DESC) AS rn
    FROM retract_stream
) WHERE rn = 1;

Configure state.ttl (e.g., state.ttl = 2h) on windows to bound memory usage.

Monitoring and Operations

Deploy a unified Flink job management platform that handles job submission, resource allocation, log inspection, and alerting. Integrate the platform with the metadata service so that any schema change automatically updates the Flink catalog.

Metadata & Lineage Workflow

Metadata service generates a Flink Catalog from stored table definitions.

During job compilation, DDL statements are parsed; new or altered tables are written back to the metadata store.

Job runtime status (running, failed, restarted) is persisted to the metadata store, enabling lineage graphs that show upstream‑downstream dependencies.

Continuous Data Quality Loop

1. Real‑time jobs write intermediate tables to Hive (e.g., rt_daily_summary).

2. Offline validation jobs compare rt_daily_summary with the batch‑generated offline_daily_summary.

3. If the difference exceeds a configured threshold, an alert is raised and the real‑time job is investigated.

This loop not only validates real‑time accuracy but also serves as a regression check for the offline pipeline.

Summary

The article presents Meituan‑Dianping’s practical experience building a Flink‑based real‑time data warehouse. It covers design philosophy, technology stack choices, layer‑by‑layer implementation (ODS, DW, dimension handling, summary, QA), and operational safeguards such as metadata management, lineage tracking, state‑size control, and continuous data‑quality validation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data engineeringFlinkKafkaOLAPreal-time data warehousedimension modeling
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.