SparkSQL Logical Plan, Analyzer, and Optimizer: An In‑Depth Overview
This article provides a comprehensive overview of SparkSQL's logical plan architecture, detailing the stages of logical plan creation, analysis, rule‑based optimization, and the underlying catalog and rule systems that transform SQL queries into efficient execution plans.
SparkSQL Logical Plan Overview
The logical plan stage is represented by the LogicalPlan class and consists of three main phases: (1) the SparkSqlParser converts syntax‑tree nodes into unresolved LogicalPlan nodes, (2) the Analyzer applies a series of rules to produce a resolved logical plan, and (3) the Optimizer applies optimization rules to improve efficiency while preserving correctness.
LogicalPlan Introduction
Overview
The parent class QueryPlan is divided into six modules covering input/output, basic attributes, string representation, normalization, expression handling, and constraints.
Basic Operations and Classification
LeafNode : Represents tables and command‑related logic, including RunnableCommand for DDL/DML operations.
UnaryNode : Handles logical transformations such as filters, repartitioning, script‑based transformations, object consumption, and basic relational operators (Project, Filter, Sort, etc.).
BinaryNode : Used for data combination operations like Join, Set operations, and CoGroup.
Other Types : Includes Union, ObjectProducer, and EventTimeWatermark for streaming watermarks.
AstBuilder Mechanism: Generating Unresolved LogicalPlan
The entry point visitSingleStatement recursively traverses the parse tree. When a node can be mapped to a LogicalPlan, it is passed upward. At QuerySpecificationContext, the FromClauseContext is visited first to create an UnresolvedRelation, after which withQuerySpecification extends the plan with filter and projection nodes.
Generate the table LogicalPlan by creating an UnresolvedRelation when a TableNameContext is encountered.
Add filter logic: expressions from BooleanDefaultContext become filter conditions, producing a Filter node combined with the relation.
Add column‑pruning: Named expressions from NamedExpressionSeqContext are transformed into a Project node that sits above the filtered plan.
Analyzed LogicalPlan Generation
The AST yields an unresolved tree composed mainly of UnresolvedRelation and UnresolvedAttribute. The Analyzer resolves these objects against the catalog, turning them into typed expressions.
Catalog System
The catalog acts as a hierarchical namespace for functions and metadata. Key components include GlobalTempViewManager (thread‑safe global view handling), FunctionResourceLoader (loads user‑defined and Hive functions), FunctionRegistry (function registration), and ExternalCatalog (persistent metadata management).
Rule Framework
Logical‑plan transformations are rule‑based. RuleExecutor holds a sequence of Batch objects; each batch contains a set of rules applied in order via RuleExecutor.apply(TreeType plan).
Analyzed LogicalPlan Detailed Steps
Apply ResolveRelations to look up tables in SessionCatalog and insert alias nodes.
Analyze filter predicates; constants are initially unresolved.
Perform implicit type casting (e.g., converting literal 18 to bigint).
Resolve attribute references in the Project node, completing the analyzed plan.
Optimizer Overview & Rule System
The optimizer also relies on batches of rules executed by a RuleExecutor. Spark’s optimizer defines 16 batches (Spark 2.1), each targeting specific transformations such as subquery elimination, expression replacement, constant folding, and operator push‑down.
Batch Finish Analysis : Ensures correct results (e.g., EliminateSubqueryAliases, ReplaceExpression, ComputeCurrentTime).
Batch Union : Merges adjacent Union nodes.
Batch Subquery : Optimizes subqueries via OptimizeSubqueries.
Batch ReplaceOperator : Replaces Intersect with left‑semi join, Except with left‑anti join, and rewrites Distinct as aggregates.
Batch Aggregate : Removes literals and duplicate expressions from GroupBy.
Batch Operator Optimizations : Includes operator push‑down, operator fusion, and constant folding.
Batch CheckCartesianProducts : Detects unintended Cartesian products.
Batch OptimizeCodegen : Refines generated code, especially case‑when constructs.
Batch OptimizeMetadataOnlyQuery : Optimizes queries that only need partition metadata.
Batch PruneFileSourceTablePartitions : Pushes partition pruning down to the data source.
Batch UserProvidedOptimizers : Allows custom user‑defined optimization rules.
Optimized LogicalPlan Generation
Remove unnecessary SubqueryAlias nodes; filters are applied directly to relations.
Add non‑null constraints derived from filter predicates.
Fold constant expressions and perform static type conversions.
The resulting optimized logical plan serves as the input for physical plan generation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
