Big Data 12 min read

Can Flink Unify Real‑Time and Offline Data Warehouses? A Deep Dive

This article examines the challenges of maintaining separate offline and real‑time data warehouses, explains the three‑layer ODS‑DW‑ADS model, evaluates the traditional Lambda architecture, and explores how a unified Flink stack with Kafka, HiveCatalog and streaming sinks can simplify metadata, SQL development, data import/export, and stateful processing for both batch and streaming workloads.

Alibaba Cloud Developer

Mar 19, 2020

Can Flink Unify Real‑Time and Offline Data Warehouses? A Deep Dive

Data Warehouse Architecture

Data warehouses consist of three layers: ODS (Operation Data Store), DW (Data Warehouse) and ADS (Application Data Store). ODS stores raw data from logs or business databases; DW is split into DWD (detail) and DWS (service) layers; ADS provides final data services directly to users.

The typical enterprise architecture separates offline and real‑time warehouses, leading to complex stacks that require many systems and specialized talent.

Lambda Architecture Overview

Most warehouses still use the Lambda architecture, which combines a batch layer (offline) and a speed layer (real‑time). While flexible, it is complex and costly.

Flink One‑Stack Computation

Flink aims to unify batch and streaming processing, leveraging Kafka for real‑time ingestion and Hive for metadata management.

Metadata Management

Offline warehouses use Hive Metastore. Kafka alone lacks metadata support, so two approaches are recommended:

Confluent Schema Registry – provides schema information via a service URL.

Catalog – Flink’s built‑in HiveCatalog can integrate Kafka tables into the Hive Metastore, allowing SQL access to both streaming and batch tables.

use catalog my_hive;
-- build streaming database and tables;
create database stream_db;
use stream_db;
create table order_table (
    id long,
    amount double,
    user_id long,
    status string,
    ts timestamp,
    ... -- other fields
    ts_day string,
    ts_hour string
) with (
    'connector.type' = 'kafka',
    ... -- Kafka table config
);

create database batch_db;
use batch_db;
create table order_table like stream_db.order_table (excluding options)
partitioned by (ts_day, ts_hour)
with (
    'connector.type' = 'hive',
    ... -- Hive table config
);

Data Import

Flink can import data into both real‑time and offline warehouses. Previously DataStream + StreamingFileSink was used, but it lacked ORC support and HMS updates. With Flink’s Hive streaming sink, SQL‑based imports become more flexible.

insert into [stream_db.|batch_db.]order_table select ... from log_table;

Dimension Table Join

Streaming jobs require dynamic dimension tables. Flink can join a JDBC‑backed dimension table, keeping it up‑to‑date via periodic imports from the batch warehouse.

-- stream dimension table
use stream_db;
create table user_info (
    user_id long,
    age int,
    address string,
    primary key (user_id)
) with (
    'connector.type' = 'jdbc',
    ...
);
-- import batch dimension into streaming
insert into user_info select * from batch_db.user_info;
-- dimension join
insert into order_with_user_age
select * from order_table join user_info for system_time as of order_table.proctime on user_info.user_id = order_table.user_id;

Stateful Computation and Data Export

Aggregations in streaming produce dynamic tables whose results continuously change. Flink supports retract streams to keep results consistent with batch, and can output to updatable sinks (e.g., MySQL, HBase) or use changelog streams for immutable sinks.

-- batch: one‑time output to MySQL
insert into mysql_table select age, avg(amount) from order_with_user_age group by age;

-- streaming: continuous upserts to MySQL
insert into mysql_table select age, avg(amount) from order_with_user_age group by age;

AD‑HOC Queries and OLAP

Offline warehouses support ad‑hoc queries on detailed or aggregated data. Real‑time warehouses often lack this capability because they do not retain historical data. One solution is to provide batch‑stream unified sinks to OLAP systems such as Druid, Doris, ClickHouse, or HBase/Phoenix.

Conclusion

By unifying metadata, SQL development, data import/export, and eventually storage, Flink’s one‑stack approach aims to deliver a seamless experience for both offline and real‑time data warehousing, reducing complexity and talent costs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Flink Data Warehouse Lambda architecture

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.