How to Build a Real‑Time Data Warehouse with Flink: Principles, Architecture, and Best Practices
This article explains why real‑time data warehouses are needed, outlines their core principles, compares them with offline warehouses, describes typical use cases such as real‑time OLAP, dashboards, feature generation and monitoring, and provides a step‑by‑step guide to designing, implementing, and operating a Flink‑based streaming warehouse with Kafka, HBase, and metadata management.
Purpose and Principles of a Real‑Time Data Warehouse
Real‑time warehouses complement offline warehouses by providing low‑latency access to the most recent data (typically a three‑day window). They do not replace historical batch processing; instead they solve problems that offline warehouses cannot meet due to timeliness constraints.
Two guiding principles are applied:
Do not duplicate functionality already solved by the offline warehouse (e.g., month‑level historical aggregates).
Avoid forcing a warehouse model onto workloads that are inherently unsuitable for warehousing, such as ultra‑high‑frequency transactional processing.
Like offline warehouses, a real‑time warehouse must be topic‑oriented, integrated, and relatively stable.
Typical Real‑Time Use Cases
Real‑time OLAP : Extend existing OLAP tools to query fresh data without redesign.
Live dashboards : Power screens for flash‑sale monitoring or daily store performance.
Real‑time feature generation : Tag users or merchants with up‑to‑date attributes (e.g., “high‑value user”).
Business KPI monitoring : Detect drops instantly to reduce loss.
Concept Mapping Between Offline and Real‑Time Warehouses
Offline development relies on Hive SQL and batch jobs (MapReduce/Spark). The real‑time counterpart uses Flink SQL and continuously running Flink streaming jobs. Offline tables are Hive tables; real‑time tables are abstracted as Stream Tables . Storage shifts from HDFS to Kafka for streams and KV stores (e.g., Tair, HBase) for dimension data.
Overall Architecture
The architecture mirrors the classic ODS → DW → Application layers but reduces the number of intermediate layers to keep latency low. Data is retained for roughly three days, guaranteeing two‑day completeness during batch gaps.
ODS Layer Construction
Unify data sources as much as possible (e.g., all binlog, traffic logs, and system logs are ingested via Kafka).
Use partitioning to keep data locally ordered and to enable efficient pruning.
Primary sources are Kafka topics fed by CDC (binlog), web traffic logs, and system logs. Consistent source selection avoids out‑of‑order issues.
DW Layer Construction
Standardize and clean raw data, align schemas with the offline warehouse, and optionally use configuration‑driven ingestion rules to keep real‑time and offline DW in sync.
Dimension Data Handling
Dimensions are split by update frequency:
Low‑frequency dimensions (e.g., geographic codes) are loaded from offline dimension tables into an in‑memory cache.
High‑frequency dimensions (e.g., price, user status) are captured via change streams and stored as “link tables” in HBase. HBase’s multi‑version capability and TTL keep storage bounded.
Summary Layer Construction
Common metrics (PV, UV, revenue) are aggregated here. To avoid excessive state usage, approximate algorithms such as BloomFilter for distinct counting and HyperLogLog for cardinality estimation are recommended.
Flink’s built‑in windows (tumbling, sliding, session) enable fine‑grained time‑based aggregations. Configure TTL on state to prevent memory blow‑up.
When writing aggregated results to Kafka, convert retract streams (which contain both insert and delete records) to append‑only streams, typically by filtering out retraction (false) records.
Warehouse Quality Assurance
Metadata and Lineage : A metadata service generates a Flink Catalog that supplies table definitions to jobs. DDL parsing updates the metadata store, and job status is written back to support lineage tracing.
Data Quality Validation : Real‑time results are periodically persisted to Hive. Offline validation tools compare these results against batch outputs. Threshold‑based alerts highlight discrepancies, enabling pre‑production checks and continuous improvement.
Key Implementation Details
Programming Model
Use Flink SQL with user‑defined functions (UDFs) just as you would use Hive SQL offline. Streaming jobs run continuously as Flink streaming programs.
Physical Storage Choices
Event streams → Kafka topics.
Dimension data → KV stores (Tair, HBase) with TTL.
Handling Duplicates and Keys
Each record receives a unique key (identifies the exact change) and a primary key (identifies the logical entity). The primary key is used for Kafka partitioning to preserve order; the unique key enables de‑duplication downstream.
Versioning and Batch IDs
When schema evolves, embed a version field in the record so downstream jobs can apply the correct schema. For re‑processing scenarios (e.g., correcting a day's data), add a batch_id to distinguish normal streaming data from re‑imported data, allowing jobs to consume only the intended batch.
Dimension “Link” Tables (Lattice Tables)
High‑frequency dimensions are stored as time‑variant link tables (often called “拉链表”). Each change creates a new row with a timestamp; queries join on the event time to retrieve the correct dimension value.
Dimension Join Strategies
Use a UDTF with LATERAL TABLE to call an external dimension service at query time.
Alternatively, parse the SQL, replace dimension references with a flatMap that performs the join, and materialize the result as a new table. The UDTF approach is simpler for most cases.
Aggregation and State Management
Aggregations produce retract streams; to write to Kafka, filter out retraction records and keep only the insert (true) records. Example pseudo‑code:
SELECT
key,
SUM(metric) AS metric_sum,
COUNT(*) AS cnt
FROM source_table
GROUP BY key;
-- The result is a retract stream.
-- Convert to append stream:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY key ORDER BY proc_time DESC) AS rn
FROM retract_stream
) WHERE rn = 1;Configure state.ttl (e.g., state.ttl = 2h) on windows to bound memory usage.
Monitoring and Operations
Deploy a unified Flink job management platform that handles job submission, resource allocation, log inspection, and alerting. Integrate the platform with the metadata service so that any schema change automatically updates the Flink catalog.
Metadata & Lineage Workflow
Metadata service generates a Flink Catalog from stored table definitions.
During job compilation, DDL statements are parsed; new or altered tables are written back to the metadata store.
Job runtime status (running, failed, restarted) is persisted to the metadata store, enabling lineage graphs that show upstream‑downstream dependencies.
Continuous Data Quality Loop
1. Real‑time jobs write intermediate tables to Hive (e.g., rt_daily_summary).
2. Offline validation jobs compare rt_daily_summary with the batch‑generated offline_daily_summary.
3. If the difference exceeds a configured threshold, an alert is raised and the real‑time job is investigated.
This loop not only validates real‑time accuracy but also serves as a regression check for the offline pipeline.
Summary
The article presents Meituan‑Dianping’s practical experience building a Flink‑based real‑time data warehouse. It covers design philosophy, technology stack choices, layer‑by‑layer implementation (ODS, DW, dimension handling, summary, QA), and operational safeguards such as metadata management, lineage tracking, state‑size control, and continuous data‑quality validation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
