Big Data 11 min read

Understanding Broadcast, Shuffle, and Sort‑Merge Joins in Spark SQL

This article explains the principles, use cases, and performance considerations of Spark SQL's three join implementations—Broadcast Hash Join, Shuffle Hash Join, and Sort‑Merge Join—illustrating how table size and distribution affect the choice of algorithm for efficient large‑scale data processing.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Broadcast, Shuffle, and Sort‑Merge Joins in Spark SQL

Introduction

Join operations are fundamental in SQL, and Spark provides three main join strategies—Broadcast Hash Join, Shuffle Hash Join, and Sort‑Merge Join—each suited to different data size scenarios.

Hash Join

A basic hash join builds a hash table from the smaller (Build) table and probes it with the larger (Probe) table, typically scanning each table once (O(a+b)).

select * from order, item where item.id = order.i_id

The process involves determining Build and Probe tables, constructing the hash table in memory (or spilling to disk), and probing the hash table to produce joined rows.

Broadcast Hash Join

When one table is small enough (default spark.sql.autoBroadcastJoinThreshold = 10 MB), Spark broadcasts it to all executor nodes, allowing each node to perform a local hash join with the larger table, thus avoiding a costly shuffle.

Conditions: the broadcasted table must be smaller than the threshold and cannot be the left side of a left outer join.

Shuffle Hash Join

If the small table is too large to broadcast, Spark shuffles both tables on the join key so that matching keys land in the same partition; each partition then performs a local hash join.

Conditions: average partition size must be below the broadcast threshold, and one side must be significantly smaller (approximately three times smaller) than the other.

Sort‑Merge Join

For joins between two large tables, Spark uses Sort‑Merge Join, which first shuffles both tables on the join key, then sorts each partition, and finally merges the sorted streams, processing rows in a streaming fashion without loading entire tables into memory.

This approach reduces memory pressure and improves stability for massive datasets.

Conclusion

Choosing the appropriate join algorithm depends on table sizes: use Broadcast Hash Join for a very small table, Shuffle Hash Join when one table is moderately sized, and Sort‑Merge Join for two large tables; Spark’s configuration spark.sql.autoBroadcastJoinThreshold can be tuned to favor broadcast joins when appropriate.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSparkSQLJoin AlgorithmsBroadcast JoinShuffle JoinSort-Merge Join
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.