Big Data 10 min read

Understanding Spark SQL Physical Execution Plans and Optimization Techniques

This article explains Spark SQL's physical execution plan, detailing each operator, how to interpret the plan, and practical optimization tips for data warehouse developers to improve SQL performance and resource utilization.

政采云技术

Jul 12, 2022

Understanding Spark SQL Physical Execution Plans and Optimization Techniques

As a data warehouse developer, mastering modeling, optimization, and SQL development is essential because SQL quality directly impacts resource utilization and cluster performance.

In Spark SQL, like other SQL engines, an execution plan is provided, allowing us to see how the query is executed and to optimize it.

However, for those unfamiliar with execution plans, especially without deep Spark Core experience, the plan can be hard to understand.

This article introduces each element of the execution plan in detail.

Environment

The following execution plan is based on this SQL:

The SQL counts each user's orders in 2022 and joins another table for detailed information (note: using * is discouraged in production; list all fields explicitly).

Execution Plan

We can obtain the physical execution plan by calling explain formatted:

The formatted option returns a more readable, formatted result; cost shows data volume per step; codegen reveals the generated code.

We can also view the physical plan graphically in the Spark UI:

Below we explain the meaning of each term.

CollapseCodegenStages

Each blue box in the diagram represents a codegen stage, equivalent to a Spark stage; stages are divided based on shuffle.

CollapseCodegenStages merges compatible operators to reduce function call overhead, improving resource utilization. Operators that cannot be merged, such as Exchange (shuffle), are excluded.

Now we examine the meaning of different operators.

Scan parquet

This operator reads data from Parquet, selecting only required columns. PartitionFilters indicate partition pruning, and PushedFilters show predicates pushed down to Parquet, reducing data transferred.

If columns are sorted, pushdown can greatly improve performance because Parquet can skip whole row groups using min/max statistics.

Filter

The Filter operator applies predicates, but Catalyst may push down predicates, combine filters, infer additional filters, or prune redundant ones.

Project

Project selects columns; optimizations include ColumnPruning, CollapseProject, and PushProjectionThroughUnion.

Exchange

Exchange represents shuffle and includes the chosen Partitioner (e.g., HashPartitioner, RoundRobinPartitioning, SinglePartition, RangePartitioning).

HashAggregate

HashAggregate performs aggregation, often with a local aggregate followed by a global aggregate after a shuffle.

BroadcastHashJoin & BroadcastExchange

Broadcast join is used when one side is small; it always appears together with BroadcastExchange. Other join types include SortMergeJoin and ShuffleHashJoin.

Thoughts

Consider the following simple SQL that filters by partition and counts records. Its plan shows partition pruning, Parquet scan, a local partial count, a single‑partition exchange, and a final global count.

*(2) HashAggregate(keys=[], functions=[count(1)])<br/>+- Exchange SinglePartition<br/>   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])<br/>      +- *(1) Project<br/>         +- *(1) FileScan parquet wrk.test_funnel_log_parquet_properties_20220401_20220410[event_id#56,pdate#57] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://ddmcNS/user/hive/warehouse/wrk.db/test_funnel_log_parquet_properti..., PartitionCount: 44, PartitionFilters: [isnotnull(pdate#57), (pdate#57 >= 2022-04-01), (pdate#57 <= 2022-04-04), event_id#56 INSET (... )], PushedFilters: [], ReadSchema: struct<>

The plan confirms that Spark first filters by partition, reads the Parquet files, performs a local count per task, shuffles, and then aggregates globally.

Parquet row groups store row counts, yet Spark still reads the full file to compute counts, suggesting potential optimization opportunities.

Summary

Today we introduced SparkSQL physical execution plans, specific parameters, practical examples, and some personal reflections, hoping to help readers efficiently optimize their SQL.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PerformanceOptimization SQL SparkSQL bigdata DataWarehouse ExecutionPlan

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.