Big Data 12 min read

SparkSQL Logical Plan, Analyzer, and Optimizer: An In‑Depth Overview

This article provides a comprehensive overview of SparkSQL's logical plan architecture, detailing the stages of logical plan creation, analysis, rule‑based optimization, and the underlying catalog and rule systems that transform SQL queries into efficient execution plans.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
SparkSQL Logical Plan, Analyzer, and Optimizer: An In‑Depth Overview

SparkSQL Logical Plan Overview

The logical plan stage is represented by the LogicalPlan class and consists of three main phases: (1) the SparkSqlParser converts syntax‑tree nodes into unresolved LogicalPlan nodes, (2) the Analyzer applies a series of rules to produce a resolved logical plan, and (3) the Optimizer applies optimization rules to improve efficiency while preserving correctness.

LogicalPlan Introduction

Overview

The parent class QueryPlan is divided into six modules covering input/output, basic attributes, string representation, normalization, expression handling, and constraints.

Basic Operations and Classification

LeafNode : Represents tables and command‑related logic, including RunnableCommand for DDL/DML operations.

UnaryNode : Handles logical transformations such as filters, repartitioning, script‑based transformations, object consumption, and basic relational operators (Project, Filter, Sort, etc.).

BinaryNode : Used for data combination operations like Join, Set operations, and CoGroup.

Other Types : Includes Union, ObjectProducer, and EventTimeWatermark for streaming watermarks.

AstBuilder Mechanism: Generating Unresolved LogicalPlan

The entry point visitSingleStatement recursively traverses the parse tree. When a node can be mapped to a LogicalPlan, it is passed upward. At QuerySpecificationContext, the FromClauseContext is visited first to create an UnresolvedRelation, after which withQuerySpecification extends the plan with filter and projection nodes.

Generate the table LogicalPlan by creating an UnresolvedRelation when a TableNameContext is encountered.

Add filter logic: expressions from BooleanDefaultContext become filter conditions, producing a Filter node combined with the relation.

Add column‑pruning: Named expressions from NamedExpressionSeqContext are transformed into a Project node that sits above the filtered plan.

Analyzed LogicalPlan Generation

The AST yields an unresolved tree composed mainly of UnresolvedRelation and UnresolvedAttribute. The Analyzer resolves these objects against the catalog, turning them into typed expressions.

Catalog System

The catalog acts as a hierarchical namespace for functions and metadata. Key components include GlobalTempViewManager (thread‑safe global view handling), FunctionResourceLoader (loads user‑defined and Hive functions), FunctionRegistry (function registration), and ExternalCatalog (persistent metadata management).

Rule Framework

Logical‑plan transformations are rule‑based. RuleExecutor holds a sequence of Batch objects; each batch contains a set of rules applied in order via RuleExecutor.apply(TreeType plan).

Analyzed LogicalPlan Detailed Steps

Apply ResolveRelations to look up tables in SessionCatalog and insert alias nodes.

Analyze filter predicates; constants are initially unresolved.

Perform implicit type casting (e.g., converting literal 18 to bigint).

Resolve attribute references in the Project node, completing the analyzed plan.

Optimizer Overview & Rule System

The optimizer also relies on batches of rules executed by a RuleExecutor. Spark’s optimizer defines 16 batches (Spark 2.1), each targeting specific transformations such as subquery elimination, expression replacement, constant folding, and operator push‑down.

Batch Finish Analysis : Ensures correct results (e.g., EliminateSubqueryAliases, ReplaceExpression, ComputeCurrentTime).

Batch Union : Merges adjacent Union nodes.

Batch Subquery : Optimizes subqueries via OptimizeSubqueries.

Batch ReplaceOperator : Replaces Intersect with left‑semi join, Except with left‑anti join, and rewrites Distinct as aggregates.

Batch Aggregate : Removes literals and duplicate expressions from GroupBy.

Batch Operator Optimizations : Includes operator push‑down, operator fusion, and constant folding.

Batch CheckCartesianProducts : Detects unintended Cartesian products.

Batch OptimizeCodegen : Refines generated code, especially case‑when constructs.

Batch OptimizeMetadataOnlyQuery : Optimizes queries that only need partition metadata.

Batch PruneFileSourceTablePartitions : Pushes partition pruning down to the data source.

Batch UserProvidedOptimizers : Allows custom user‑defined optimization rules.

Optimized LogicalPlan Generation

Remove unnecessary SubqueryAlias nodes; filters are applied directly to relations.

Add non‑null constraints derived from filter predicates.

Fold constant expressions and perform static type conversions.

The resulting optimized logical plan serves as the input for physical plan generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SparkSQLbigdataoptimizerScalaAnalyzerLogicalPlan
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.