Understanding SparkSQL Join Algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join
This article explains SparkSQL's three join strategies—Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join—detailing their mechanisms, when to use each based on table size, and their relative performance costs in distributed big‑data environments.
SparkSQL Join Algorithms Overview
SparkSQL supports three join algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join. The first two are variations of the classic Hash Join, adapted with shuffle or broadcast steps for distributed processing, while Sort Merge Join is a distinct approach for large tables.
Hash Join
A typical hash join involves three steps: (1) determining the Build Table (usually the smaller table) and the Probe Table (the larger one); (2) constructing a hash table from the Build Table's join keys, stored in memory or spilled to disk; and (3) probing the hash table with rows from the Probe Table to produce the joined result.
Broadcast Hash Join
When one side of the join is small enough (default spark.sql.autoBroadcastJoinThreshold is 10 MB), Spark can broadcast that table to all executors. The broadcast phase distributes the small table, and the hash join phase builds a hash table on each executor and matches it against partitions of the larger table. Broadcast joins are only allowed when the small table is on the right side of a left outer join.
Sort Merge Join
For joins between two large tables, SparkSQL uses Sort Merge Join. Both tables are first shuffled based on the join key, then each partition is locally sorted. Finally, the sorted partitions are merged by scanning the two ordered streams and emitting matches, which avoids loading entire tables into memory.
The cost hierarchy is: Broadcast Hash Join < Shuffle Hash Join < Sort Merge Join. To improve performance, avoid large‑large table joins and consider increasing spark.sql.autoBroadcastJoinThreshold so more joins can be executed as broadcast joins.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
