Big Data 12 min read

Understanding Spark Catalyst and Tungsten Optimizations in Spark SQL

This article explains how Spark SQL's Catalyst optimizer performs logical and physical planning, details the Tungsten engine's data‑structure and whole‑stage code generation improvements, compares them with the Volcano iterator model, and provides code examples and PDF resources for deeper study.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Spark Catalyst and Tungsten Optimizations in Spark SQL

The article introduces Spark SQL as the officially recommended engine in Spark 3.0, noting that its optimizations now account for nearly 50% of the release, while other components like PySpark, MLlib, and Streaming receive far less attention.

Catalyst Optimization

Catalyst consists of two main stages: logical optimization and physical optimization.

Catalyst Logical Optimization

During logical optimization, the plan progresses from an Unresolved Logical Plan to an Analyzed Logical Plan, and finally to an Optimized Logical Plan using heuristic‑based rules.

Convert Unresolved Logical Plan to Analyzed Logical Plan.

Apply heuristic rules to transform Analyzed Logical Plan into Optimized Logical Plan.

The most common rule categories are Predicate Pushdown, Column Pruning, and Constant Folding.

Catalyst Physical Optimization

Physical optimization starts from the Optimized Logical Plan, generates a Spark Plan, and then produces a Physical Plan. Two phases are involved:

Mapping logical operators to physical operators using predefined strategies to create the Spark Plan.

Applying preparation rules (e.g., EnsureRequirements) to refine the Spark Plan into an executable Physical Plan.

<code style="padding:16px;color:#ddd;font-family:Operator Mono,Consolas,Monaco,Menlo,monospace;font-size:12px">protected[sql] val planner = new SparkPlanner
// contains strategies for optimizing the physical execution plan
protected[sql] class SparkPlanner extends SparkStrategies {
  val sparkContext: SparkContext = self.sparkContext
  val sqlContext: SQLContext = self
  def codegenEnabled: Boolean = self.conf.codegenEnabled
  def unsafeEnabled: Boolean = self.conf.unsafeEnabled
  def numPartitions: Int = self.conf.numShufflePartitions
  // Convert LogicalPlan to actual operations; implementations are in org.apache.spark.sql.execution
  def strategies: Seq[Strategy] =
    experimental.extraStrategies ++ (
      DataSourceStrategy ::
      DDLStrategy ::
      TakeOrdered ::
      HashAggregation ::
      LeftSemiJoin ::
      HashJoin ::
      InMemoryScans ::
      ParquetOperations ::
      BasicOperators ::
      CartesianProduct ::
      BroadcastNestedLoopJoin :: Nil)
  ...
}
</code>

The EnsureRequirements rule guarantees correct partitioning and ordering by inserting shuffle or sort operations when needed.

Tungsten Optimization

Tungsten improves Spark's execution engine in two ways: data‑structure design and Whole‑Stage Code Generation (WSCG).

Data Structure Design

It introduces the UnsafeRow byte array to reduce storage overhead and uses memory pages to manage both on‑heap and off‑heap memory, improving GC behavior and cache locality.

Whole‑Stage Code Generation (WSCG)

WSCG dynamically generates Java code that fuses multiple operators into a single function, eliminating virtual function dispatch and reducing memory traffic, which yields significant performance gains over the traditional Volcano iterator model.

Volcano Iterator Model

The original Spark SQL engine used the Volcano iterator model, where each operator implements a next() method. While flexible, this approach incurs virtual calls and memory buffering overhead.

<code style="padding:16px;color:#ddd;font-family:Operator Mono,Consolas,Monaco,Menlo,monospace;font-size:12px">class Filter(child: Operator, predicate: (Row => Boolean)) extends Operator {
  def next(): Row = {
    var current = child.next()
    while (current == null || predicate(current)) {
      current = child.next()
    }
    return current
  }
}
</code>

Hand‑written loops avoid these overheads by keeping data in registers and allowing the compiler to apply loop unrolling and SIMD optimizations.

Overall, the article provides a deep dive into Spark's Catalyst and Tungsten optimizations, compares them with the older Volcano model, and includes code snippets and references for further exploration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Code GenerationBig DataSQL OptimizationSparkTungstenCatalyst
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.