Can Flink Unify Real‑Time and Offline Data Warehouses? A Deep Dive
This article examines the challenges of maintaining separate offline and real‑time data warehouses, explains the three‑layer ODS‑DW‑ADS model, evaluates the traditional Lambda architecture, and explores how a unified Flink stack with Kafka, HiveCatalog and streaming sinks can simplify metadata, SQL development, data import/export, and stateful processing for both batch and streaming workloads.
Data Warehouse Architecture
Data warehouses consist of three layers: ODS (Operation Data Store), DW (Data Warehouse) and ADS (Application Data Store). ODS stores raw data from logs or business databases; DW is split into DWD (detail) and DWS (service) layers; ADS provides final data services directly to users.
The typical enterprise architecture separates offline and real‑time warehouses, leading to complex stacks that require many systems and specialized talent.
Lambda Architecture Overview
Most warehouses still use the Lambda architecture, which combines a batch layer (offline) and a speed layer (real‑time). While flexible, it is complex and costly.
Flink One‑Stack Computation
Flink aims to unify batch and streaming processing, leveraging Kafka for real‑time ingestion and Hive for metadata management.
Metadata Management
Offline warehouses use Hive Metastore. Kafka alone lacks metadata support, so two approaches are recommended:
Confluent Schema Registry – provides schema information via a service URL.
Catalog – Flink’s built‑in HiveCatalog can integrate Kafka tables into the Hive Metastore, allowing SQL access to both streaming and batch tables.
use catalog my_hive;
-- build streaming database and tables;
create database stream_db;
use stream_db;
create table order_table (
id long,
amount double,
user_id long,
status string,
ts timestamp,
... -- other fields
ts_day string,
ts_hour string
) with (
'connector.type' = 'kafka',
... -- Kafka table config
);
create database batch_db;
use batch_db;
create table order_table like stream_db.order_table (excluding options)
partitioned by (ts_day, ts_hour)
with (
'connector.type' = 'hive',
... -- Hive table config
);Data Import
Flink can import data into both real‑time and offline warehouses. Previously DataStream + StreamingFileSink was used, but it lacked ORC support and HMS updates. With Flink’s Hive streaming sink, SQL‑based imports become more flexible.
insert into [stream_db.|batch_db.]order_table select ... from log_table;Dimension Table Join
Streaming jobs require dynamic dimension tables. Flink can join a JDBC‑backed dimension table, keeping it up‑to‑date via periodic imports from the batch warehouse.
-- stream dimension table
use stream_db;
create table user_info (
user_id long,
age int,
address string,
primary key (user_id)
) with (
'connector.type' = 'jdbc',
...
);
-- import batch dimension into streaming
insert into user_info select * from batch_db.user_info;
-- dimension join
insert into order_with_user_age
select * from order_table join user_info for system_time as of order_table.proctime on user_info.user_id = order_table.user_id;Stateful Computation and Data Export
Aggregations in streaming produce dynamic tables whose results continuously change. Flink supports retract streams to keep results consistent with batch, and can output to updatable sinks (e.g., MySQL, HBase) or use changelog streams for immutable sinks.
-- batch: one‑time output to MySQL
insert into mysql_table select age, avg(amount) from order_with_user_age group by age;
-- streaming: continuous upserts to MySQL
insert into mysql_table select age, avg(amount) from order_with_user_age group by age;AD‑HOC Queries and OLAP
Offline warehouses support ad‑hoc queries on detailed or aggregated data. Real‑time warehouses often lack this capability because they do not retain historical data. One solution is to provide batch‑stream unified sinks to OLAP systems such as Druid, Doris, ClickHouse, or HBase/Phoenix.
Conclusion
By unifying metadata, SQL development, data import/export, and eventually storage, Flink’s one‑stack approach aims to deliver a seamless experience for both offline and real‑time data warehousing, reducing complexity and talent costs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
