Big Data 14 min read

Understanding Flink Table Store: Design, Usage, and Roadmap

Flink Table Store, an Apache Flink subproject, provides a unified stream‑batch storage layer with SQL‑based table APIs, addressing real‑time and offline data needs, detailing its design goals, usage patterns, architectural layers, implementation choices, and upcoming roadmap.

DataFunTalk
DataFunTalk
DataFunTalk
Understanding Flink Table Store: Design, Usage, and Roadmap

Introduction Flink Table Store is a subproject of Apache Flink that serves as a crucial storage component in the evolution of unified stream‑batch processing. This presentation introduces the design motivations, core requirements, and future planning of Flink Table Store.

Business Requirements Real‑time data warehouses need a dynamic table concept similar to Hive’s offline tables, where data continuously changes. Traditional Kafka middle‑layers are not queryable and have limited retention, leading to high analysis costs. A queryable intermediate storage layer is required to bridge real‑time and offline pipelines.

Core Requirements for Flink Table Store 1. Provide a unified stream‑batch storage with OLAP‑style columnar queries and millisecond‑level streaming reads, supporting Insert Overwrite. 2. Serve as the most complete Flink connector, supporting all Flink SQL concepts, any Flink job output, and all data types. 3. Offer the best user experience as a Flink connector, delivering database‑level interaction and large‑scale updates.

Using Flink Table Store 1. Table Creation : Creating a table materializes physical storage; users do not need to specify connection details, which are configured at the session level. Primary‑key indexes enable fast point queries. 2. Unified Read : Tables support both batch and streaming reads. Switching between modes is done by setting the execution.runtime-mode parameter to batch (snapshot query) or streaming (continuous change capture). 3. Unified Write : Real‑time tables can continue writing while an offline job rewrites erroneous partitions. This is achieved by toggling execution.runtime-mode and applying filter conditions in SQL to launch an offline rewrite task.

Understanding Flink Table Store 1. Layered Design : Data is classified into gold (high‑realtime), silver (moderate‑realtime), and bronze (low‑realtime) tiers. The storage stack consists of a DFS layer for raw data, a File Store (near‑real‑time lake), the Table Store (real‑time pipeline for silver data), and a Table Store Service with cache for gold data. 2. Existing Implementations : Two mature approaches are Copy‑on‑Write (merge updates into new files) and Merge‑on‑Read (defer merging until read time). Indexing techniques include Key Min/Max, Bloom Filter, and Key‑Value Index. File‑merge methods include Sort‑Merge Join and Hash Join. 3. Flink Table Store Architecture : Uses an LSM‑Tree as the underlying storage, providing efficient point and range queries. Data is written to a File Store (LSM buckets) and a Log Store (write‑ahead log). Batch reads pull from File Store; streaming reads combine File Store snapshots with Log Store deltas.

Project Roadmap - V0.1: Basic log storage without service capabilities. - V0.2: Hive integration, schema evolution, full LSM implementation – production‑ready but still non‑service. - V0.3: Service‑level capabilities enabling true stream‑batch unified storage. - V0.4: Mature version for dimension‑table joins in streaming computations.

Q&A Q1: Is Flink Table Store the future direction for Flink tables? – Yes, storage will be a focus to provide unified stream‑batch guarantees. Q2: Does Flink Table Store support joins? – It does not support dimension‑table joins due to point‑lookup limitations, but snapshot‑based OLAP joins are supported. Q3: How does Flink Table Store mitigate LSM write/read amplification? – Configurable parameters prioritize read performance while maintaining sufficient throughput.

Thank you for attending.

Big DataFlinkLSM TreestreamingstorageTable Store
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.