How Spark’s Catalyst Optimizer Transforms SQL Queries: Trees, Rules, and Code Generation
This article explains Spark SQL’s Catalyst optimizer, describing its extensible design, tree‑based representation, rule‑driven transformations, batch execution to a fixed point, and how Scala’s pattern matching and quasiquotes enable efficient analysis, logical optimization, physical planning, and code generation.
This article is based on the fourth section of the paper "Spark SQL: Relational Data Processing in Spark" which introduces the Catalyst optimizer.
Catalyst Optimizer
To implement Spark SQL we designed a new extensible optimizer called Catalyst, built on Scala’s functional programming constructs. Its design has two goals:
Make it easy to add new optimization techniques and features, especially for big‑data challenges such as semi‑structured data and advanced analytics.
Allow developers to extend the optimizer, for example by adding data‑source‑specific rules that push filters or aggregates down to external storage or by supporting new data types.
Catalyst supports both rule‑based and cost‑based optimization. Unlike earlier extensible optimizers that required complex domain‑specific languages and compiler pipelines, Catalyst leverages standard Scala features like pattern matching, giving developers the full power of a programming language while keeping rule specification simple.
The core of Catalyst includes a generic library for representing tree structures and applying rules to manipulate those trees. On top of this framework we built libraries for relational query processing (expressions, logical plans) and multiple rule sets that handle analysis, logical optimization, physical planning, and code generation. Quasiquotes are used in the code‑generation stage to emit Java bytecode at runtime.
4.1 Trees
The main data type in Catalyst is a tree of node objects. Each node has a type and zero or more children, defined as subclasses of TreeNode. Nodes are immutable and can be transformed functionally.
Example node classes:
Literal(value: Int) – a constant value.
Attribute(name: String) – an input column, e.g., "x".
Add(left: TreeNode, right: TreeNode) – the sum of two expressions.
The expression x + (1 + 2) can be represented as:
Add(Attribute("x"), Add(Literal(1), Literal(2)))4.2 Rules
Rules map one tree to another. In Catalyst a rule is typically expressed as a set of pattern‑matching functions that find and replace sub‑trees with a specific structure.
Catalyst provides a transform method that recursively applies a partial function to every node in the tree.
Example rule that merges two constant literals:
tree.transform {</code><code> case Add(Literal(c1), Literal(c2)) => Literal(c1 + c2)</code><code>}Applying this rule to the tree for x + (1 + 2) yields a new tree x + 3.
The case keyword is standard Scala pattern matching, allowing extraction of values (e.g., c1 and c2). The partial function only needs to match a subset of possible inputs.
Catalyst automatically skips parts of the tree that do not match a rule and continues traversing, so rules need only handle the structures they care about.
Rules are grouped into batches and repeatedly applied until a fixed point is reached, meaning the tree no longer changes. This batch execution enables simple, independent rules to have a global effect.
Repeated application also enables constant folding on larger trees, such as transforming (x + 0) + (3 + 3) into x + 6. Different batches can perform type inference first and then use those types for further optimizations.
After each batch, developers can run consistency checks (e.g., ensuring all attributes have assigned types) using additional pattern matches.
Because rule bodies can contain arbitrary Scala code, Catalyst is more powerful than optimizers that rely on dedicated DSLs, yet remains concise for simple cases.
Functional transformations on immutable trees make the optimizer easier to understand, debug, and parallelize.
4.3 Using Catalyst in Spark SQL
Spark SQL’s optimization pipeline consists of four stages, illustrated in Figure 3:
(1) Analysis – resolve references, (2) Logical Optimization – rule‑based rewrites, (3) Physical Planning – generate one or more physical plans and choose the cheapest, (4) Code Generation – compile parts of the query to Java bytecode.
During physical planning, Catalyst may generate multiple plans and compare them using a cost model; other stages are purely rule‑driven.
4.3.1 Analysis
Spark SQL starts from a relation that may come from the SQL parser’s abstract syntax tree or from a DataFrame constructed via the API. The relation can contain unresolved attribute references or table names.
For example, in the query SELECT col FROM sales, the type and validity of col can only be determined after looking up the table sales. Unresolved attributes are marked as such.
Analysis uses Catalyst rules together with a Catalog object to resolve attributes. It first builds an “unresolved logical plan” tree and then applies the following steps:
Lookup relations by name in the catalog.
Map named attributes to inputs provided by operator children.
Identify attributes that refer to the same value and assign unique identifiers.
Propagate and enforce types throughout the expression tree.
4.3.2 Logical Optimization
This stage applies standard rule‑based optimizations to the logical plan, such as constant folding, predicate push‑down, projection pruning, null propagation, and Boolean expression simplification.
Adding new rules is straightforward. For instance, when introducing a fixed‑precision DECIMAL type, a rule can rewrite SUM and AVG expressions on low‑precision DECIMALs to use unscaled 64‑bit LONGs for aggregation, then convert the result back to DECIMAL.
object DecimalAggregates extends Rule[LogicalPlan] {</code><code> val MAX_LONG_DIGITS = 18</code><code> def apply(plan: LogicalPlan): LogicalPlan = {</code><code> plan transformAllExpressions {</code><code> case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 <= MAX_LONG_DIGITS =></code><code> MakeDecimal(Sum(UnscaledValue(e)), prec + 10, scale)</code><code> }</code><code> }</code><code>}Similarly, a 12‑line rule can rewrite simple LIKE expressions into calls to String.startsWith or String.contains, demonstrating the ease of expressing optimizations with arbitrary Scala code.
4.3.3 Physical Planning
In this stage Spark SQL receives the logical plan and generates one or more physical plans using operators that match the Spark execution engine. A cost model selects the best plan.
Currently, cost‑based optimization is used mainly for choosing join algorithms (e.g., broadcast join for small relations). The framework is designed to support richer cost‑based optimizations in the future.
Rule‑based physical optimizations also occur, such as collapsing projection or filter operations into a single Spark map, and pushing down operations to data sources that support predicate or projection push‑down.
4.3.4 Code Generation
The final stage generates Java bytecode that runs on each executor. Because Spark SQL works on in‑memory datasets, CPU efficiency is critical, and code generation dramatically improves performance.
Catalyst uses Scala’s quasiquotes to construct Scala ASTs programmatically, which are then compiled to bytecode. For example, the tree nodes from Section 4.2 can be compiled as follows:
def compile(node: Node): AST = node match {</code><code> case Literal(value) => q"$value"</code><code> case Attribute(name) => q"row.get($name)"</code><code> case Add(left, right) => q"${compile(left)} + ${compile(right)}"</code><code>}Quasiquotes are prefixed with q and allow variable insertion using the $ symbol. They are type‑checked at compile time, making them safer and more composable than string concatenation.
Generated code is further optimized by the Scala compiler, and the same infrastructure can generate direct field accesses for native Java objects, avoiding the overhead of converting them to Spark SQL rows.
In summary, Catalyst’s four core phases are:
Analysis : resolve column names, types, and bind table structures (fully rule‑based).
Logical Optimization : apply rule‑driven transformations such as constant folding and predicate push‑down.
Physical Planning : generate executable Spark operators with a simple cost model and rule‑based physical optimizations.
Code Generation : compile expressions to Java bytecode using quasiquotes, achieving performance close to hand‑written code.
Big Data Technology Tribe
Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
