Understanding Spark Catalyst and Tungsten Optimizations in Spark SQL
This article explains how Spark SQL's Catalyst optimizer performs logical and physical planning, details the Tungsten engine's data‑structure and whole‑stage code generation improvements, compares them with the Volcano iterator model, and provides code examples and PDF resources for deeper study.
The article introduces Spark SQL as the officially recommended engine in Spark 3.0, noting that its optimizations now account for nearly 50% of the release, while other components like PySpark, MLlib, and Streaming receive far less attention.
Catalyst Optimization
Catalyst consists of two main stages: logical optimization and physical optimization.
Catalyst Logical Optimization
During logical optimization, the plan progresses from an Unresolved Logical Plan to an Analyzed Logical Plan, and finally to an Optimized Logical Plan using heuristic‑based rules.
Convert Unresolved Logical Plan to Analyzed Logical Plan.
Apply heuristic rules to transform Analyzed Logical Plan into Optimized Logical Plan.
The most common rule categories are Predicate Pushdown, Column Pruning, and Constant Folding.
Catalyst Physical Optimization
Physical optimization starts from the Optimized Logical Plan, generates a Spark Plan, and then produces a Physical Plan. Two phases are involved:
Mapping logical operators to physical operators using predefined strategies to create the Spark Plan.
Applying preparation rules (e.g., EnsureRequirements) to refine the Spark Plan into an executable Physical Plan.
<code style="padding:16px;color:#ddd;font-family:Operator Mono,Consolas,Monaco,Menlo,monospace;font-size:12px">protected[sql] val planner = new SparkPlanner
// contains strategies for optimizing the physical execution plan
protected[sql] class SparkPlanner extends SparkStrategies {
val sparkContext: SparkContext = self.sparkContext
val sqlContext: SQLContext = self
def codegenEnabled: Boolean = self.conf.codegenEnabled
def unsafeEnabled: Boolean = self.conf.unsafeEnabled
def numPartitions: Int = self.conf.numShufflePartitions
// Convert LogicalPlan to actual operations; implementations are in org.apache.spark.sql.execution
def strategies: Seq[Strategy] =
experimental.extraStrategies ++ (
DataSourceStrategy ::
DDLStrategy ::
TakeOrdered ::
HashAggregation ::
LeftSemiJoin ::
HashJoin ::
InMemoryScans ::
ParquetOperations ::
BasicOperators ::
CartesianProduct ::
BroadcastNestedLoopJoin :: Nil)
...
}
</code>The EnsureRequirements rule guarantees correct partitioning and ordering by inserting shuffle or sort operations when needed.
Tungsten Optimization
Tungsten improves Spark's execution engine in two ways: data‑structure design and Whole‑Stage Code Generation (WSCG).
Data Structure Design
It introduces the UnsafeRow byte array to reduce storage overhead and uses memory pages to manage both on‑heap and off‑heap memory, improving GC behavior and cache locality.
Whole‑Stage Code Generation (WSCG)
WSCG dynamically generates Java code that fuses multiple operators into a single function, eliminating virtual function dispatch and reducing memory traffic, which yields significant performance gains over the traditional Volcano iterator model.
Volcano Iterator Model
The original Spark SQL engine used the Volcano iterator model, where each operator implements a next() method. While flexible, this approach incurs virtual calls and memory buffering overhead.
<code style="padding:16px;color:#ddd;font-family:Operator Mono,Consolas,Monaco,Menlo,monospace;font-size:12px">class Filter(child: Operator, predicate: (Row => Boolean)) extends Operator {
def next(): Row = {
var current = child.next()
while (current == null || predicate(current)) {
current = child.next()
}
return current
}
}
</code>Hand‑written loops avoid these overheads by keeping data in registers and allowing the compiler to apply loop unrolling and SIMD optimizations.
Overall, the article provides a deep dive into Spark's Catalyst and Tungsten optimizations, compares them with the older Volcano model, and includes code snippets and references for further exploration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
