Big Data 10 min read

Real-Time Data Warehouse Architecture and Challenges Using Flink, Kafka, and HBase

This article examines the design of a real-time data warehouse built on Flink, Kafka, and HBase, compares it with traditional offline warehouses, and discusses key challenges such as data accuracy, latency, and the complexity of maintaining real-time dimension tables.

Big Data Technology & Architecture

Jan 8, 2020

Real-Time Data Warehouse Architecture and Challenges Using Flink, Kafka, and HBase

1. Requirement Background

With the rapid development of big‑data applications, the need for timely data has become a must‑have; offline analysis is no longer sufficient. Real‑time processing frameworks such as Storm, Spark‑Streaming, and Flink address five essential requirements: state management, exactly‑once semantics, high throughput, elastic scalability, and fault tolerance. Flink satisfies all of these and also offers a high‑level Flink‑SQL API that reduces development cost and improves maintainability.

2. Offline vs. Real‑Time Data Warehouse Comparison

Traditional offline warehouses rely on batch ingestion and periodic ETL, while real‑time warehouses stream data continuously.

Real‑time warehouse architecture diagram:

3. Detailed Real‑Time Warehouse Architecture

3.1 Data Ingestion (Source)

Log data (traffic logs and binlog) is sent to an Nginx server, collected by Flume, and forwarded to Kafka. Kafka retains only the most recent day's data because older logs are not needed for real‑time analysis.

3.2 Data Transformation (Transform)

Flink‑SQL reads raw logs from Kafka, applies a custom UDTF to parse them, and writes structured records to a second Kafka topic (the ODS layer).

To verify accuracy, ODS data is also written to HDFS and mapped to Hive tables (hourly) for comparison with offline Hive tables.

The DWD layer reads from ODS and performs dimensional reductions according to business logic.

DM/RPT/APP layers use Flink window calculations, store results in Kafka, and then write them to HDFS and Hive (hourly) for downstream consumption.

3.3 Data Storage (Sink)

Real‑time dimension tables are stored in HBase, public‑layer data remains in Kafka, and rolling logs are persisted to HDFS.

4. Real‑Time Warehouse Challenges

4.1 Ensuring Data Accuracy

Differences between real‑time and offline ingestion are reconciled by re‑writing parsed Kafka data to HDFS, loading it into hourly Hive tables, and aligning time fields (e.g., using the nginx_ts field) to enable fair comparisons.

4.2 Controlling Latency

The main latency source is the UDTF parsing step. Benchmarking shows a processing speed of about 800 records/second (≈1.25 ms per record) with a single Flink core; higher ingest rates require additional cores.

4.3 Complexity of Real‑Time Dimension Tables

Full‑scale offline dimension tables (≈100 million rows) are too costly to rebuild in real time. A pseudo‑real‑time approach generates dimension tables at 24:00 for use the next day (T‑1 model). The process simplifies offline logic, drops rarely used fields, and uses an MD5‑based change‑flag to sync only modified rows to HBase, reducing daily write volume from hundreds of millions to a few million records.

Author: 愤怒的谜团 | Source: https://www.jianshu.com/p/18e21bd352b7

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Flink kafka Data Warehouse HBase

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.