Big Data 7 min read

Key Features of Apache Flink 1.20: Materialized Tables, DISTRIBUTED BY, and State/Checkpoint Optimizations

The article reviews Apache Flink 1.20, highlighting the new Materialized Table concept, the DISTRIBUTED BY support for load‑balanced storage and join performance, and state/checkpoint file merging improvements, while providing code examples and practical insights for users.

Big Data Technology & Architecture

Aug 5, 2024

Key Features of Apache Flink 1.20: Materialized Tables, DISTRIBUTED BY, and State/Checkpoint Optimizations

On August 2, Apache Flink 1.20 was released, and this article provides a concise analysis of the official release notes, focusing on the features that are most relevant to users.

Materialized Table : Flink 1.20 introduces the concept of a Materialized Table. By defining a query and a freshness interval, the engine automatically derives the table schema and creates the corresponding data‑processing pipeline to ensure query results meet the required freshness.

Through defining query statements and data freshness, the engine will automatically infer the table structure and create the corresponding data processing chain to guarantee the query results satisfy the required freshness.

Example code for creating, suspending, resuming, and refreshing a materialized table:

-- 1. Create materialized table and define freshness
CREATE MATERIALIZED TABLE dwd_orders (
  PRIMARY KEY(ds, id) NOT ENFORCED
) PARTITIONED BY (ds)
FRESHNESS = INTERVAL '3' MINUTE
AS SELECT
  o.ds,
  o.id,
  o.order_number,
  o.user_id,
  ...
FROM orders AS o
LEFT JOIN products FOR SYSTEM_TIME AS OF proctime() AS prod ON o.product_id = prod.id
LEFT JOIN order_pay AS pay ON o.id = pay.order_id AND o.ds = pay.ds;

-- 2. Suspend data refresh
ALTER MATERIALIZED TABLE dwd_orders SUSPEND;

-- 3. Resume data refresh
ALTER MATERIALIZED TABLE dwd_orders RESUME;
WITH (
  'sink.parallesim' = '10'
);

-- Manually refresh historical data
ALTER MATERIALIZED TABLE dwd_orders REFRESH PARTITION(ds='20231023');

This defines a sink operator named dwd_orders with a downstream refresh frequency of three minutes; the sink can be Kafka, Paimon, Hudi, etc.

DISTRIBUTED BY : Flink 1.20 adds support for bucket/distributed storage, a concept also present in Hive, Doris and other engines. It aims to split data into disjoint subsets to achieve load balancing in external storage systems.

First, achieve data load balancing at the sink side.

Flink sink connectors (e.g., the Kafka sink) already support a sink.partitioner configuration, which works together with Flink's partition operator to balance data on the sink side.

Many companies have further extended the Kafka sink connector with additional parameters to improve usability.

Second, Flink as a compute engine must consider join performance optimization.

By simplifying table creation while allowing the engine to perceive the physical distribution of external data, Flink prepares for future optimizations such as bucket joins.

State and Checkpoint Optimizations : Flink 1.20 introduces a unified checkpoint file merging mechanism that consolidates many small checkpoint files into fewer large ones, reducing file‑creation/deletion overhead and easing filesystem metadata pressure. It also enables RocksDB API‑driven background merging of small files generated by the RocksDB state backend.

For additional improvements, refer to the official documentation.

Official Announcement – Apache Flink 1.20 Release

big data stream processing Apache Flink Materialized Table Checkpoint Optimization Distributed By

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.