Understanding Flink SQL Architecture, Optimizations, and Internal Mechanisms
This article explains the evolution of Apache Flink's SQL support, detailing the Blink Planner architecture, the end‑to‑end Flink SQL workflow, logical and physical planning, code generation, stream‑specific optimizations such as retraction and mini‑batch, and future development directions.
Four years ago the Apache Flink community added SQL support to simplify and unify batch and stream processing. Today Flink runs critical batch and streaming SQL queries at companies like Alibaba, Huawei, Lyft, Uber, and Yelp. This talk aims to help readers understand the Flink SQL engine, covering its architecture, execution workflow, and optimizations.
1. Flink SQL Architecture & Blink Planner (1.9+)
Old Planner limitations : Before version 1.9 the Table API & SQL were unified for users, but the translation layer split into DataStream API for streaming and DataSet API for batch, requiring separate runtime plans and limiting reuse.
Unified Blink Planner : Blink Planner treats batch SQL as a special case of stream SQL, abstracting common processing and optimization logic. It uses the Stream Transformation API to handle both streams and batches, replacing the old planner. The new planner is plugin‑based and will become the default in version 1.11.
2. Flink SQL Workflow
The workflow from a SQL statement or Table API program to an executable JobGraph consists of three stages:
Convert SQL/Table API code to a logical execution plan (Logical Plan).
Optimize the Logical Plan into a physical execution plan (Physical Plan).
Generate Transformations via code generation and compile them into a JobGraph for execution.
3. Logical Planning
Flink SQL uses Apache Calcite to parse SQL into a lexical tree, validates it against the catalog, and translates it into relational algebra (RelNode). The optimizer then creates the initial Logical Plan. Table API code follows a similar path using the Table API validator.
Example SQL:
SELECT t1.id, 1 + 2 + t1.value AS v FROM t1, t2 WHERE t1.id = t2.id AND t2.id < 1000The resulting logical tree looks like:
LogicalProject(id=[$0], v=[+(+(1, 2), $1)])
+- LogicalFilter(condition=[AND(=($0, $3), <($3, 1000))])
+- LogicalJoin(condition=[true], joinType=[inner])
:- LogicalTableScan(table=[[default_catalog, default, t1]])
+- LogicalTableScan(table=[[default_catalog, default, t2]])4. Physical Planning on Batch
The optimizer evaluates costs for alternative physical operators (e.g., SortMergeJoin, HashJoin, BroadcastHashJoin) and selects the cheapest plan. In the example, because only 1,000 rows from t2 reach the join, a BroadcastJoin is chosen.
5. Translation & Code Generation
During physical planning, Flink generates Java code for each operator (Transformations). For the filter t2.id < 1000 , code is generated to emit only rows satisfying the predicate.
6. Physical Planning on Stream
Stream mode introduces the Retraction (Changelog) mechanism, which emits update_before and update_after messages to correct earlier results, essential for aggregations that would otherwise produce inconsistent outputs.
Example of a word‑frequency query demonstrates why retraction is needed when intermediate results change.
7. Update_before Decision
The optimizer determines for each node whether update_before messages are required. It annotates nodes with changelog types ([I], [U], [D]) and propagates requirements top‑down and bottom‑up to ensure correct message generation.
8. Internal Optimizations
BinaryRow : Replaces the generic Row (Object[]) with a compact binary format, reducing serialization overhead and enabling field‑level access via offsets.
Mini‑batch Processing : Buffers a small amount of data before aggregation to lower state access and serialization costs, improving throughput and reducing retraction‑induced jitter.
Skew Processing : Handles data skew by applying local‑global aggregation or adding a DISTINCT key to hash‑partition the data before a two‑stage aggregation.
Top‑N Rewrite : Rewrites ROW_NUMBER() based top‑N queries into a specialized Rank node that maintains a min‑heap/max‑heap, avoiding full sorting.
9. Summary & Futures
The article recaps the introduction of the Blink Planner in Flink 1.9+, the internal workings of the Flink SQL engine, and numerous optimizer rules that improve performance while keeping the user experience transparent.
Future work includes making Blink Planner the default in Flink 1.11+, FLIP‑95/105/115/123 initiatives for TableSource/TableSink redesign, CDC support, a generalized filesystem connector, and Hive DDL/DML compatibility.
Thank you for reading.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.