Understanding Join Execution in Spark SQL
This article explains how Spark SQL processes joins—including inner, outer, semi, and anti joins—by describing the overall query planning flow, the three physical join strategies (sort‑merge, broadcast, and hash), and the specific implementation details for each join type.
SparkSQL Overall Process
SparkSQL queries can be written as SQL statements or built using Dataset/DataFrame APIs, both of which are parsed into logical plans, optimized, and then translated into physical plans that are executed as RDDs.
Join Fundamentals
A join consists of three elements: the join type, join condition, and optional filter condition. Spark supports inner, left/right outer, full outer, left semi, left anti, left/right semi, and other join types.
Physical Join Implementations
Spark provides three main physical join strategies:
Sort‑Merge Join : Both sides are shuffled and sorted; during the merge phase, each record from the larger (stream) side is matched against the sorted smaller (build) side using sequential search.
Broadcast Join : When the build side is small (default threshold 10 MB), Spark broadcasts it to all executors and builds a hash table, avoiding a shuffle.
Hash Join : After shuffle, the build side is placed into a hash table without sorting; it is used only when the build side fits in memory and specific configuration flags are set.
Sort‑Merge Join Details
Both tables are shuffled based on the join key, sorted, and then merged; this allows efficient sequential lookup in the build side.
Broadcast Join Details
If the estimated size of the build side is below spark.sql.autoBroadcastJoinThreshold , Spark automatically chooses a broadcast join, effectively performing a map‑side join.
Hash Join Details
Hash join is disabled by default and requires four conditions: the build side exceeds the broadcast threshold, the flag spark.sql.join.preferSortMergeJoin is set to false, each partition’s build side fits in memory, and the stream side is at least three times larger.
Join Type Implementations
Inner Join
Spark automatically assigns the larger table as the stream side and the smaller as the build side, joining only matching records.
Left Outer Join
All rows from the left (stream) side are retained; non‑matching rows from the right side produce null‑filled records.
Right Outer Join
Symmetric to left outer join, with the right side as the stream side.
Full Outer Join
Implemented using sort‑merge join; it combines left and right outer join logic without duplicating inner‑join results.
Left Semi Join
Returns only rows from the left side that have a match in the right side.
Left Anti Join
Returns only rows from the left side that have no match in the right side.
Conclusion
Understanding SparkSQL’s join mechanisms—including the overall query flow, physical strategies, and specific join‑type implementations—helps developers grasp performance characteristics and choose the most appropriate join approach for their big‑data workloads.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.