Big Data 5 min read

Understanding SparkSQL Join Algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join

This article explains SparkSQL's three join strategies—Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join—detailing their mechanisms, when to use each based on table size, and their relative performance costs in distributed big‑data environments.

Big Data Technology & Architecture

Nov 16, 2019

Understanding SparkSQL Join Algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join

SparkSQL Join Algorithms Overview

SparkSQL supports three join algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join. The first two are variations of the classic Hash Join, adapted with shuffle or broadcast steps for distributed processing, while Sort Merge Join is a distinct approach for large tables.

Hash Join

A typical hash join involves three steps: (1) determining the Build Table (usually the smaller table) and the Probe Table (the larger one); (2) constructing a hash table from the Build Table's join keys, stored in memory or spilled to disk; and (3) probing the hash table with rows from the Probe Table to produce the joined result.

Broadcast Hash Join

When one side of the join is small enough (default spark.sql.autoBroadcastJoinThreshold is 10 MB), Spark can broadcast that table to all executors. The broadcast phase distributes the small table, and the hash join phase builds a hash table on each executor and matches it against partitions of the larger table. Broadcast joins are only allowed when the small table is on the right side of a left outer join.

Sort Merge Join

For joins between two large tables, SparkSQL uses Sort Merge Join. Both tables are first shuffled based on the join key, then each partition is locally sorted. Finally, the sorted partitions are merged by scanning the two ordered streams and emitting matches, which avoids loading entire tables into memory.

The cost hierarchy is: Broadcast Hash Join < Shuffle Hash Join < Sort Merge Join. To improve performance, avoid large‑large table joins and consider increasing spark.sql.autoBroadcastJoinThreshold so more joins can be executed as broadcast joins.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SparkSQL Hash Join Join Algorithms Broadcast Join Sort Merge Join

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.