Real-Time Data Warehouse Architecture and Challenges Using Flink, Kafka, and HBase
This article examines the design of a real-time data warehouse built on Flink, Kafka, and HBase, compares it with traditional offline warehouses, and discusses key challenges such as data accuracy, latency, and the complexity of maintaining real-time dimension tables.
1. Requirement Background
With the rapid development of big‑data applications, the need for timely data has become a must‑have; offline analysis is no longer sufficient. Real‑time processing frameworks such as Storm, Spark‑Streaming, and Flink address five essential requirements: state management, exactly‑once semantics, high throughput, elastic scalability, and fault tolerance. Flink satisfies all of these and also offers a high‑level Flink‑SQL API that reduces development cost and improves maintainability.
2. Offline vs. Real‑Time Data Warehouse Comparison
Traditional offline warehouses rely on batch ingestion and periodic ETL, while real‑time warehouses stream data continuously.
Real‑time warehouse architecture diagram:
3. Detailed Real‑Time Warehouse Architecture
3.1 Data Ingestion (Source)
Log data (traffic logs and binlog) is sent to an Nginx server, collected by Flume, and forwarded to Kafka. Kafka retains only the most recent day's data because older logs are not needed for real‑time analysis.
3.2 Data Transformation (Transform)
Flink‑SQL reads raw logs from Kafka, applies a custom UDTF to parse them, and writes structured records to a second Kafka topic (the ODS layer).
To verify accuracy, ODS data is also written to HDFS and mapped to Hive tables (hourly) for comparison with offline Hive tables.
The DWD layer reads from ODS and performs dimensional reductions according to business logic.
DM/RPT/APP layers use Flink window calculations, store results in Kafka, and then write them to HDFS and Hive (hourly) for downstream consumption.
3.3 Data Storage (Sink)
Real‑time dimension tables are stored in HBase, public‑layer data remains in Kafka, and rolling logs are persisted to HDFS.
4. Real‑Time Warehouse Challenges
4.1 Ensuring Data Accuracy
Differences between real‑time and offline ingestion are reconciled by re‑writing parsed Kafka data to HDFS, loading it into hourly Hive tables, and aligning time fields (e.g., using the nginx_ts field) to enable fair comparisons.
4.2 Controlling Latency
The main latency source is the UDTF parsing step. Benchmarking shows a processing speed of about 800 records/second (≈1.25 ms per record) with a single Flink core; higher ingest rates require additional cores.
4.3 Complexity of Real‑Time Dimension Tables
Full‑scale offline dimension tables (≈100 million rows) are too costly to rebuild in real time. A pseudo‑real‑time approach generates dimension tables at 24:00 for use the next day (T‑1 model). The process simplifies offline logic, drops rarely used fields, and uses an MD5‑based change‑flag to sync only modified rows to HBase, reducing daily write volume from hundreds of millions to a few million records.
Author: 愤怒的谜团 | Source: https://www.jianshu.com/p/18e21bd352b7
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
