Can a Streaming Data Warehouse Balance Freshness, Latency, and Cost?
This article examines the core trade‑offs of data warehouses—freshness, query latency, and cost—compares offline and real‑time architectures, introduces the concept of a streaming data warehouse, and details how Apache Flink Table Store aims to provide a unified, low‑cost solution.
Data Warehouse Computation
In the computer field, a data warehouse (DW or DWH) is a system for reporting and data analysis, considered a core component of business intelligence. It stores current and historical data in one place to create analytical reports for the whole enterprise.
Typical ETL‑based warehouses use ODS, DWD, and DWS layers to host key functions, allowing analysts to query each layer for valuable business insights.
Key Metrics
Data Freshness : the time from data generation to when it becomes queryable after processing, usually handled by ETL jobs.
Query Latency : the time from a user’s query request to receiving results, directly affecting user experience.
Cost : the resources required to perform ETL and query operations.
These three metrics form a trade‑off triangle: enterprises aim to improve freshness and latency while controlling cost, but improvements in one area often affect the others.
Industry Mainstream Architectures
Offline Warehouse
Offline warehouses use batch ETL with partition‑level INSERT OVERWRITE, offering good cost control but suffering from poor freshness (typically T+1) and limited handling of changelog streams.
Real‑Time Warehouse
Real‑time warehouses built on Flink + Kafka achieve second‑level end‑to‑end latency and excellent freshness, yet they face two major problems: the intermediate Kafka layer is not queryable, and the real‑time pipeline incurs high storage and maintenance costs.
Desired Unified Architecture
A unified architecture should provide a queryable table abstraction for both streaming and batch data, allowing users to subscribe to change logs and run OLAP queries on the same table.
Streaming Data Warehouse Concept
The goal is a unified system that balances freshness, latency, and cost, supporting real‑time, near‑real‑time, and offline workloads while keeping costs low.
Flink Table Store
Flink Table Store is a stream‑batch integrated storage designed for streaming data warehouses. It extends Flink’s capabilities from computation to storage, offering a unified table format, native Kafka log integration, and low‑cost lake storage.
Architecture Overview
Coordinator : manages executors, handles client discovery, and oversees lifecycle.
Data Manager : manages table versions, interacts with the metastore, checkpoints versions, and handles caching and indexing based on write patterns.
Resource Manager : distributes table buckets across executors and dynamically allocates them.
Executor : receives updates, writes to local cache and disk, flushes to underlying lake storage, and serves real‑time OLAP queries and queue consumption.
Metastore : an abstract node that can connect to Hive Metastore or a filesystem‑based store, holding basic table metadata while detailed data resides in lake storage.
Lake Storage
Lake storage is built on columnar file formats (Apache ORC) and LSM structures, stored on DFS/Object Store, providing low‑cost, update‑friendly storage with efficient query acceleration via data skipping.
Cold‑Hot Separation
Streaming pipeline & online OLAP query: data flows through the coordinator and executors.
Batch pipeline & offline query: data is read/written via the metastore to lake storage.
Service data is freshest (minutes‑level checkpoint), while lake storage holds slightly older data; both remain consistent.
Short‑Term Goals
The immediate roadmap focuses on delivering a unified table abstraction, accelerating offline warehouses, and providing real‑time middle‑layer queryability without compromising Kafka stability.
Planned features for version 0.2.0 include Hive reader support, bucket scaling, append‑only mode, full schema evolution, and broader engine compatibility (Presto, Trino, Spark).
Future Plans
Mid‑term objectives aim to introduce a Flink Table Store Service for millisecond‑level streaming pipelines and strong OLAP capabilities, further bridging the gap between streaming and batch processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
