Understanding JOIN Operators: From Traditional Databases to Apache Flink Streaming
This article explains the purpose and types of SQL JOIN operators, demonstrates their syntax and semantics with examples, compares traditional database joins to Apache Flink's streaming two‑stream join implementation, and discusses optimization techniques such as state management, shuffle handling, and join reordering.
In the first part the article introduces the concept of JOIN, explaining why relational databases need join operators to combine data from multiple tables and how normalization (1NF‑BCNF) prevents storing all data in a single large table.
The different types of joins are listed: CROSS JOIN, INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, SELF JOIN, and non‑equijoin. Their semantics are illustrated with a concrete example involving a student, course, and score table.
SQL syntax differences between the older SQL‑89 style (comma‑separated tables with conditions in WHERE) and the clearer SQL‑92 style (join condition in ON) are shown:
SELECT a.colA, b.colA<br/>FROM tab1 AS a , tab2 AS b<br/>WHERE a.id = b.id AND a.other > b.otherand
SELECT a.colA, b.colA<br/>FROM tab1 AS a JOIN tab2 AS b ON a.id = b.id<br/>WHERE a.other > b.otherSeveral query examples demonstrate how each join type works, including the use of CASE WHEN s.score IS NULL THEN 0 ELSE s.score END to fill missing scores, and how filter push‑down can improve performance for inner joins.
The article then shifts to Apache Flink, describing how Flink implements joins on unbounded streams. It highlights three core differences from traditional database joins: infinite input streams, continuously updating results, and the need to buffer both sides of the stream in state.
For stream joins Flink uses key‑based partitioning (shuffle) to ensure matching keys are processed on the same task, and stores incoming events in two separate states (LState and RState). When a left‑hand event arrives it is stored and joined with all buffered right‑hand events, and vice‑versa.
Implementation details for INNER JOIN and LEFT OUTER JOIN are explained, including how Flink emits retract ("‑") messages to withdraw previously emitted rows when a matching right‑hand event later arrives. The state data structure is a nested map: Map<JoinKey, Map<RowData, Count>>, which efficiently tracks duplicate rows, positive and retract messages, and the first matching event for each key.
Finally, the article discusses practical optimizations: constructing primary‑key sources to reduce data volume, avoiding NULL‑induced hotspots in multi‑stage left joins, and reordering join operations to minimize state pressure and shuffle traffic. The conclusion summarizes the reasons for join operators, their semantics, optimization techniques, and Flink’s streaming join architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
