Big Data 10 min read

Understanding Spark 3.0 Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP)

This article explains the two most important Spark 3.0 features—Adaptive Query Execution and Dynamic Partition Pruning—detailing how AQE dynamically optimizes join strategies, partition coalescing, and skew handling, while DPP reduces I/O by pruning irrelevant fact‑table partitions at runtime.

Big Data Technology & Architecture

Dec 21, 2021

Understanding Spark 3.0 Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP)

Introduction

Spark 3.0 has been released for a long time, bringing many exciting new features such as Dynamic Partition Pruning (DPP), Adaptive Query Execution (AQE), accelerator‑aware scheduling, catalog‑aware data source API, vectorization in SparkR, and support for Hadoop 3, JDK 11, Scala 2.12, etc.

The two most important features are AQE and DPP, which are described below.

AQE (Adaptive Query Execution)

AQE is a dynamic optimization mechanism for Spark SQL that can adjust the logical and physical plan at runtime based on shuffle‑map statistics.

It is enabled by setting spark.sql.adaptive.enabled to true (default false in Spark 3.0).

Before AQE, Spark relied on rule‑based optimization (RBO) and cost‑based optimization (CBO). RBO uses heuristic rules, while CBO uses statistics to choose the cheapest plan.

AQE provides three main capabilities:

Join strategy adjustment

Automatic partition coalescing

Automatic skew handling

Join strategy adjustment

Broadcast Hash Join (BHJ) is preferred when the small side fits in memory. AQE can correct mis‑estimated broadcast sizes and switch to BHJ when appropriate.

Example images illustrate how AQE can turn a regular shuffle join into a broadcast join after runtime statistics are collected.

When Spark estimates the small side below the broadcast threshold, it chooses BHJ; however, estimation errors can occur, and AQE fixes them using accurate runtime statistics.

Automatic partition coalescing

Shuffle is often the performance bottleneck. AQE can merge small partitions based on parameters such as spark.sql.adaptive.advisoryPartitionSizeInBytes and spark.sql.adaptive.coalescePartitions.minPartitionNum, reducing the number of reduce tasks.

Illustrative example shows a query with an initial shuffle partition count of 5 being reduced to 3 tasks after AQE.

After AQE, only three reduce tasks are generated.

Automatic skew handling

Data skew in join keys can cause long‑running tasks. AQE detects skewed partitions using parameters like spark.sql.adaptive.skewJoin.skewedPartitionFactor, spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes, and splits them into smaller sub‑partitions.

Images demonstrate how a heavily skewed partition is split and processed in parallel, improving overall query latency.

DPP (Dynamic Partition Pruning)

DPP prunes unnecessary partitions of a fact table at runtime based on join predicates with a dimension table, reducing I/O and improving performance.

It works only when the fact table is partitioned, the join is an equality join, and the filtered dimension side is smaller than the broadcast threshold ( spark.sql.autoBroadcastJoinThreshold).

Typical usage example:

SELECT * FROM dim 
JOIN fact 
ON (dim.col = fact.col) 
WHERE dim.col = 'dummy'

These two features—AQE and DPP—are the most important enhancements in Spark 3.0.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data sql-optimization Spark Adaptive Query Execution Dynamic Partition Pruning

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.