Big Data 12 min read

Understanding Join Implementations in Spark SQL

This article explains the various join types supported by Spark SQL, describes the overall Spark SQL execution flow, and details the physical implementation processes of inner, outer, semi, anti, broadcast, sort‑merge, and hash joins, helping developers grasp how joins are executed in a distributed environment.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Join Implementations in Spark SQL

Join is a crucial SQL syntax feature, and almost every moderately complex data analysis scenario relies on it. Spark SQL (Dataset/DataFrame) has become the mainstream way to develop Spark applications, so developers need to understand how joins are organized and executed in Spark.

Overview of SparkSQL Process

Before discussing join implementation, we briefly introduce the overall SparkSQL workflow. There are two ways to use SparkSQL: writing raw SQL statements (which require a metastore such as Hive) or using the Dataset/DataFrame APIs to build query plans. The query plan consists of a logical plan and a physical plan. After parsing, analysis, and a series of optimizations, the logical plan is transformed into a physical plan, which is then mapped to RDD execution.

The article focuses on the physical execution of joins.

Basic Elements of a Join

A join generally involves three elements: join type, join condition, and filter condition (the filter can also be expressed as an AND clause within the join condition).

Spark supports all join types, including:

inner join

left outer join

right outer join

full outer join

left semi join

left anti join

Basic Join Execution Flow

Generally, Spark abstracts the two tables participating in a join as a streaming table (streamIter) and a build table (buildIter). The streaming table is usually the larger one, while the build table is the smaller one; Spark automatically decides which table plays each role based on the join statement.

During execution, Spark iterates over streamIter; for each rowA it computes keyA, looks up matching rowsB in buildIter (where keyB == keyA), joins each matching rowB with rowA, and finally applies any filter conditions to produce the result set.

Because each row from streamIter requires a lookup in buildIter, the build side must be a data structure with good lookup performance. Spark provides three join implementations: sort‑merge join, broadcast join, and hash join.

Sort‑Merge Join

To join records, they must reside in the same partition, which typically requires a shuffle. In the shuffle write phase, records are partitioned by the join key. In the shuffle read phase, both streamIter and buildIter are merge‑sorted. Since both tables are sorted, Spark can perform sequential scans, starting each new lookup from the previous position, resulting in efficient lookup performance.

Broadcast Join

If the build side is very small, Spark can skip the shuffle and broadcast the entire build table to every executor, storing it in a hash table (also known as a map join). Spark automatically chooses broadcast join when the estimated size of the build side is below the configuration parameter spark.sql.autoBroadcastJoinThreshold (default 10 MB).

Hash Join

In a hash join, Spark does not sort records during the shuffle read phase. Instead, it builds a hash table from the build side within each partition, allowing constant‑time lookups. However, the build side must be small enough to fit into memory per partition. Hash join is disabled by default and is used only when the following conditions are met:

The build side size exceeds the broadcast threshold, so broadcast join is not applicable.

The configuration spark.sql.join.preferSortMergeJoin is set to false.

The average size of each partition is below the broadcast threshold, ensuring the hash table fits in memory.

The stream side is at least three times larger than the build side.

Specific Join Types

Inner Join

Both sides must have matching rows. Spark’s optimizer automatically treats the larger table as the streaming side and the smaller table as the build side to improve lookup efficiency.

Left Outer Join

The left table is the primary side; if no matching row is found on the right, Spark outputs a row with nulls for the right‑side columns.

Right Outer Join

Symmetric to left outer join; the right table is primary, and unmatched left rows are filled with nulls.

Full Outer Join

Combines the behavior of left and right outer joins. Spark implements it using sort‑merge join, treating each side alternately as streamIter and buildIter.

Left Semi Join

Returns only rows from the left table that have at least one matching row on the right; the right‑side columns are not returned.

Left Anti Join

Returns rows from the left table that have no matching rows on the right.

Summary

Join is a fundamental operation in database queries; SparkSQL, as a distributed data‑warehouse system, provides comprehensive join support and performs many hidden optimizations. Understanding the implementation details of different join strategies helps developers better grasp the execution path of their Spark applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JOINdataframedistributed computingspark-sql
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.