Databases 22 min read

Understanding JOIN Operators: From Traditional Databases to Apache Flink Streaming

This article explains the purpose and types of SQL JOIN operators, demonstrates their syntax and semantics with examples, compares traditional database joins to Apache Flink's streaming two‑stream join implementation, and discusses optimization techniques such as state management, shuffle handling, and join reordering.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding JOIN Operators: From Traditional Databases to Apache Flink Streaming

In the first part the article introduces the concept of JOIN, explaining why relational databases need join operators to combine data from multiple tables and how normalization (1NF‑BCNF) prevents storing all data in a single large table.

The different types of joins are listed: CROSS JOIN, INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, SELF JOIN, and non‑equijoin. Their semantics are illustrated with a concrete example involving a student, course, and score table.

SQL syntax differences between the older SQL‑89 style (comma‑separated tables with conditions in WHERE) and the clearer SQL‑92 style (join condition in ON) are shown:

SELECT a.colA, b.colA<br/>FROM tab1 AS a , tab2 AS b<br/>WHERE a.id = b.id AND a.other > b.other

and

SELECT a.colA, b.colA<br/>FROM tab1 AS a JOIN tab2 AS b ON a.id = b.id<br/>WHERE a.other > b.other

Several query examples demonstrate how each join type works, including the use of CASE WHEN s.score IS NULL THEN 0 ELSE s.score END to fill missing scores, and how filter push‑down can improve performance for inner joins.

The article then shifts to Apache Flink, describing how Flink implements joins on unbounded streams. It highlights three core differences from traditional database joins: infinite input streams, continuously updating results, and the need to buffer both sides of the stream in state.

For stream joins Flink uses key‑based partitioning (shuffle) to ensure matching keys are processed on the same task, and stores incoming events in two separate states (LState and RState). When a left‑hand event arrives it is stored and joined with all buffered right‑hand events, and vice‑versa.

Implementation details for INNER JOIN and LEFT OUTER JOIN are explained, including how Flink emits retract ("‑") messages to withdraw previously emitted rows when a matching right‑hand event later arrives. The state data structure is a nested map: Map<JoinKey, Map<RowData, Count>>, which efficiently tracks duplicate rows, positive and retract messages, and the first matching event for each key.

Finally, the article discusses practical optimizations: constructing primary‑key sources to reduce data volume, avoiding NULL‑induced hotspots in multi‑stage left joins, and reordering join operations to minimize state pressure and shuffle traffic. The conclusion summarizes the reasons for join operators, their semantics, optimization techniques, and Flink’s streaming join architecture.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SQLState ManagementdatabaseApache FlinkStreaming
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.