Big Data 6 min read

Flink 1.19 New Features: SQL Optimizations, Runtime Enhancements, and Checkpointing Improvements

The article reviews Flink 1.19’s new features, highlighting SQL capability enhancements such as custom source parallelism, TTL hints, and MiniBatch support for regular joins, as well as runtime dynamic parallelism for batch jobs and flexible checkpointing intervals for different data sources.

Big Data Technology & Architecture

Mar 20, 2024

Flink 1.19 New Features: SQL Optimizations, Runtime Enhancements, and Checkpointing Improvements

SQL Capability Optimization

SQL capability optimizations focus on three features: source table custom parallelism, SQL hint TTL configuration, and Regular Join MiniBatch optimization.

Source Table Custom Parallelism

Flink 1.19 introduces the ability to set scan.parallelism to configure parallelism, currently supported by the DataGen connector. Adjusting source parallelism improves consumption speed and join efficiency for sources such as RocketMQ, MySQL, and Redis. Kafka is a special case; its parallelism should match the number of partitions, otherwise it defaults to the job’s maximum parallelism.

SQL Hint TTL Configuration

The new version simplifies state TTL configuration via SQL hints. Example statements show how to set TTL for joins and aggregations, making state size reduction and resource consumption more manageable in production environments.

-- set state ttl for join
SELECT /*+ STATE_TTL('Orders'='1d', 'Customers'='20d') */ *
FROM Orders LEFT OUTER JOIN Customers ON Orders.o_custkey = Customers.c_custkey;

-- set state ttl for aggregation
SELECT /*+ STATE_TTL('o'='1d') */ o_orderkey, SUM(o_totalprice) AS revenue
FROM Orders AS o
GROUP BY o_orderkey;

Regular Join MiniBatch Optimization

Regular Join suffers from performance bottlenecks due to frequent state access. The MiniBatch optimization introduces batch deduplication to alleviate this issue, improving throughput for large or frequently accessed state.

Runtime Optimization

Flink 1.19 adds dynamic parallelism inference for source tables in batch jobs, allowing connectors to determine parallelism based on actual data volume. Currently, this feature is supported by the FileSource connector.

Checkpoint

The new release enables setting different checkpointing intervals for different data sources, such as reading a Hive table followed by consuming Kafka data, by configuring separate interval parameters.

execution.checkpointing.interval: 30sec
execution.checkpointing.interval-during-backlog: 30min

These are the key capabilities introduced in Flink 1.19 that users should monitor and experiment with in production environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink stream processing SQL parallelism checkpointing

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.