Understanding Spark 3.0 Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP)
This article explains the two most important Spark 3.0 features—Adaptive Query Execution and Dynamic Partition Pruning—detailing how AQE dynamically optimizes join strategies, partition coalescing, and skew handling, while DPP reduces I/O by pruning irrelevant fact‑table partitions at runtime.
Introduction
Spark 3.0 has been released for a long time, bringing many exciting new features such as Dynamic Partition Pruning (DPP), Adaptive Query Execution (AQE), accelerator‑aware scheduling, catalog‑aware data source API, vectorization in SparkR, and support for Hadoop 3, JDK 11, Scala 2.12, etc.
The two most important features are AQE and DPP, which are described below.
AQE (Adaptive Query Execution)
AQE is a dynamic optimization mechanism for Spark SQL that can adjust the logical and physical plan at runtime based on shuffle‑map statistics.
It is enabled by setting spark.sql.adaptive.enabled to true (default false in Spark 3.0).
Before AQE, Spark relied on rule‑based optimization (RBO) and cost‑based optimization (CBO). RBO uses heuristic rules, while CBO uses statistics to choose the cheapest plan.
AQE provides three main capabilities:
Join strategy adjustment Automatic partition coalescing Automatic skew handlingJoin strategy adjustment
Broadcast Hash Join (BHJ) is preferred when the small side fits in memory. AQE can correct mis‑estimated broadcast sizes and switch to BHJ when appropriate.
Example images illustrate how AQE can turn a regular shuffle join into a broadcast join after runtime statistics are collected.
When Spark estimates the small side below the broadcast threshold, it chooses BHJ; however, estimation errors can occur, and AQE fixes them using accurate runtime statistics.
Automatic partition coalescing
Shuffle is often the performance bottleneck. AQE can merge small partitions based on parameters such as spark.sql.adaptive.advisoryPartitionSizeInBytes and spark.sql.adaptive.coalescePartitions.minPartitionNum, reducing the number of reduce tasks.
Illustrative example shows a query with an initial shuffle partition count of 5 being reduced to 3 tasks after AQE.
After AQE, only three reduce tasks are generated.
Automatic skew handling
Data skew in join keys can cause long‑running tasks. AQE detects skewed partitions using parameters like spark.sql.adaptive.skewJoin.skewedPartitionFactor, spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes, and splits them into smaller sub‑partitions.
Images demonstrate how a heavily skewed partition is split and processed in parallel, improving overall query latency.
DPP (Dynamic Partition Pruning)
DPP prunes unnecessary partitions of a fact table at runtime based on join predicates with a dimension table, reducing I/O and improving performance.
It works only when the fact table is partitioned, the join is an equality join, and the filtered dimension side is smaller than the broadcast threshold ( spark.sql.autoBroadcastJoinThreshold).
Typical usage example:
SELECT * FROM dim
JOIN fact
ON (dim.col = fact.col)
WHERE dim.col = 'dummy'These two features—AQE and DPP—are the most important enhancements in Spark 3.0.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
