Understanding Spark SQL Join Strategies, Catalyst Optimizer, and Tungsten for Big Data Processing
This article explains Spark SQL join classifications, the mechanics of Nested Loop Join, Sort‑Merge Join, and Hash Join, and describes how the Catalyst optimizer and Tungsten project improve query execution and memory efficiency in large‑scale data environments.
This article is part of the "Path to Big Data Mastery" PDF series, which can be downloaded by replying with "PDF" to the public account.
Spark SQL Joins
Spark SQL supports several join types such as Nested Loop Join (NLJ), Sort‑Merge Join (SMJ), and Hash Join (HJ), as well as Shuffle Join and Broadcast Join.
From the perspective of data distribution, joins are divided into two major categories: Shuffle Join and Broadcast Join.
From the implementation mechanism, joins can be further classified into NLJ, SMJ, and HJ.
The Cartesian product of the two distribution modes and three implementation mechanisms results in five join strategies supported by Spark (the white BroadCast SMJ is not supported).
The selection rules are:
Equi‑join: Broadcast HJ > Shuffle SMJ > Shuffle HJ
Non‑equi join: Broadcast NLJ > Shuffle NLJ
Nested Loop Join
In NLJ, the left table (driver table) and the right table (base table) are scanned with two nested loops. If the driver table has M rows and the base table has N rows, the computational complexity is O(M × N), which is simple but inefficient.
Sort‑Merge Join
When both tables are large, Spark SQL uses Sort‑Merge Join, which first shuffles data by join keys, partitions them, sorts each partition, and then merges matching keys. This avoids loading an entire side into memory and has a complexity of O(M + N).
Hash Join
Hash Join aims to reduce the build‑side scan cost to O(1) by constructing a hash table on the base table (Build phase) and probing it with the driver table (Probe phase). Matching keys are combined to produce the join result.
Spark SQL Optimizer
Catalyst Optimizer
The Catalyst optimizer creates and refines execution plans through syntax tree generation, logical optimization, and physical optimization.
Key optimization steps include:
Parse SQL and generate an abstract syntax tree (AST).
Attach metadata (column names and types) to the AST.
Apply optimization rules to the enriched AST.
Generate a physical plan from the logical plan, which is then executed as RDDs.
Important optimizations are Predicate Pushdown (filtering early) and Column Pruning (removing unused columns) to reduce data processed.
Tungsten
Tungsten is a major Spark improvement focused on memory and CPU efficiency, covering three areas:
Memory Management and Binary Processing: off‑heap memory to reduce object overhead and GC pauses.
Cache‑aware Computation: optimized storage for better CPU cache hit rates.
Code Generation: faster execution of Spark SQL code.
Tungsten introduces the Unsafe Row binary format, a compact byte‑array representation of DataFrame rows that dramatically reduces storage overhead and speeds up data access compared to standard Java objects.
How to Get the Full PDF Series
The "Path to Big Data Mastery" collection is fully available in PDF form. Reply with "PDF" to the public account to receive the Alibaba Cloud Drive download link.
All articles have been organized by topic for easy lookup in the public account.
Hi, I am Wang Zhiwu, an original author in the big data field. Follow me for the latest industry news.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
