Big Data 10 min read

Understanding Join Execution in Spark SQL

This article explains how Spark SQL processes joins—including inner, outer, semi, and anti joins—by describing the overall query planning flow, the three physical join strategies (sort‑merge, broadcast, and hash), and the specific implementation details for each join type.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding Join Execution in Spark SQL

SparkSQL Overall Process

SparkSQL queries can be written as SQL statements or built using Dataset/DataFrame APIs, both of which are parsed into logical plans, optimized, and then translated into physical plans that are executed as RDDs.

Join Fundamentals

A join consists of three elements: the join type, join condition, and optional filter condition. Spark supports inner, left/right outer, full outer, left semi, left anti, left/right semi, and other join types.

Physical Join Implementations

Spark provides three main physical join strategies:

Sort‑Merge Join : Both sides are shuffled and sorted; during the merge phase, each record from the larger (stream) side is matched against the sorted smaller (build) side using sequential search.

Broadcast Join : When the build side is small (default threshold 10 MB), Spark broadcasts it to all executors and builds a hash table, avoiding a shuffle.

Hash Join : After shuffle, the build side is placed into a hash table without sorting; it is used only when the build side fits in memory and specific configuration flags are set.

Sort‑Merge Join Details

Both tables are shuffled based on the join key, sorted, and then merged; this allows efficient sequential lookup in the build side.

Broadcast Join Details

If the estimated size of the build side is below spark.sql.autoBroadcastJoinThreshold , Spark automatically chooses a broadcast join, effectively performing a map‑side join.

Hash Join Details

Hash join is disabled by default and requires four conditions: the build side exceeds the broadcast threshold, the flag spark.sql.join.preferSortMergeJoin is set to false, each partition’s build side fits in memory, and the stream side is at least three times larger.

Join Type Implementations

Inner Join

Spark automatically assigns the larger table as the stream side and the smaller as the build side, joining only matching records.

Left Outer Join

All rows from the left (stream) side are retained; non‑matching rows from the right side produce null‑filled records.

Right Outer Join

Symmetric to left outer join, with the right side as the stream side.

Full Outer Join

Implemented using sort‑merge join; it combines left and right outer join logic without duplicating inner‑join results.

Left Semi Join

Returns only rows from the left side that have a match in the right side.

Left Anti Join

Returns only rows from the left side that have no match in the right side.

Conclusion

Understanding SparkSQL’s join mechanisms—including the overall query flow, physical strategies, and specific join‑type implementations—helps developers grasp performance characteristics and choose the most appropriate join approach for their big‑data workloads.

Big DatajoinSQL Optimizationdistributed computingspark sqlDataFrames
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.