Big Data 17 min read

What’s New in Apache Spark 3.0? Explore Dynamic Partition Pruning, AQE, and More

Apache Spark 3.0, released after a 21‑month development cycle, introduces dynamic partition pruning, adaptive query execution, accelerator‑aware scheduling, DataSource V2, enhanced pandas UDFs, new join hints, richer monitoring, ANSI‑SQL compatibility, SparkR vectorization, Kafka header support, and numerous platform upgrades, all backed by over 3,400 resolved issues.

dbaplus Community

Jun 20, 2020

What’s New in Apache Spark 3.0? Explore Dynamic Partition Pruning, AQE, and More

Release Overview

Apache Spark 3.0.0 was officially launched just before the Spark Summit AI conference, concluding a 21‑month development effort that began on 2 Oct 2018. The release followed two preview versions and three voting rounds (RC1, RC2, RC3) and resolves more than 3,400 JIRA issues across Spark’s components.

Key New Features

Dynamic Partition Pruning

Adaptive Query Execution (AQE)

Accelerator‑aware Scheduling

DataSource API V2

Enhanced pandas UDFs with Python type hints

Comprehensive join hints

New built‑in functions for Scala

Improved monitoring UI and EXPLAIN command

Better ANSI‑SQL compatibility

SparkR vectorization using Apache Arrow

Kafka streaming support for message headers

GPU/FPGA scheduling, Kubernetes enhancements, JDK 11, Scala 2.12, Hadoop 3.x, Python 3.x, and other platform updates

Dynamic Partition Pruning

This optimization prunes unnecessary partitions at runtime based on filter predicates. For example, a query joining dim_iteblog and fact_iteblog can skip scanning large portions of the fact table, yielding up to a 33× speedup and 2‑18× gains on 60 of 102 TPC‑DS queries (see SPARK‑11150, SPARK‑28888).

Adaptive Query Execution (AQE)

AQE allows Spark SQL to adjust execution plans during runtime using collected statistics. Implemented from the ideas in SPARK‑9850, the feature provides three capabilities: dynamic shuffle‑partition merging, join‑strategy adaptation, and skew‑join optimization. Enabling it requires setting spark.sql.adaptive=true (default false). Benchmarks on a 1 TB TPC‑DS dataset show up to 8× speedup for query q77 and 2× for q5 (see SPARK‑23128).

Accelerator‑aware Scheduling

To better utilize GPUs, FPGAs, and TPUs, Spark 3.0 adds native support for accelerator‑aware task scheduling, aligning with efforts from Databricks, NVIDIA, Google, and Alibaba. The work is tracked in SPARK‑24615 and the SPIP document “Accelerator‑aware scheduling”.

DataSource API V2

DataSource V2 decouples data‑source APIs from higher‑level Spark components, improving extensibility and eliminating many limitations of V1 (e.g., reliance on SQLContext, limited push‑down, no transactional writes, no streaming support). The stable version ships with Spark 3.0, referenced by SPARK‑15689, SPARK‑25186, and SPARK‑22386.

Enhanced APIs and Functions

Spark 3.0 introduces new pandas UDF types (iterator‑of‑series → iterator‑of‑series, iterator‑of‑multiple‑series → iterator‑of‑series) and three pandas function APIs (grouped map, map, co‑grouped map). It also adds a full set of join hints (SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL) and 32 new Scala built‑in functions, including map‑specific utilities such as transform_key and map_zip_with.

Monitoring Improvements

The Structured Streaming UI has been redesigned to show aggregate job information and detailed metrics (input rate, process rate, batch duration, etc.). The EXPLAIN command now supports a formatted output mode and can dump plans to files. Observable metrics allow custom aggregation functions to emit events after query completion.

ANSI‑SQL Compatibility

To narrow gaps with PostgreSQL and the ANSI‑SQL:2011 standard, Spark 3.0 addresses 231 sub‑issues (SPARK‑27764), adding missing functions, handling reserved keywords, and improving compliance.

SparkR Vectorization

SparkR now leverages Apache Arrow for vectorized reads and writes, eliminating costly JVM‑based serialization and achieving performance gains of several thousand times for R‑to‑Spark data exchanges (see SPARK‑26759).

Kafka Streaming: includeHeaders

Kafka 0.11 introduced record headers (KIP‑82). Spark 3.0 adds the includeHeaders option to the Kafka source, enabling header consumption. Example usage:

val df = spark.readStream.format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .option("includeHeaders", "true")
  .load()

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers")
  .as[(String, String, Map[String, Array[Byte]])]

Other Notable Changes

GPU scheduling support on YARN and Kubernetes.

Removal of Scala 2.11 support; default to Scala 2.12 (SPARK‑26132).

Support for Hadoop 3.x (SPARK‑23710) and JDK 11 (SPARK‑24417).

Python 2.x dropped; Spark 3.0 runs on Python 3.x (SPARK‑27884).

Spark Graph now supports Cypher queries.

Event‑log rolling introduced (SPARK‑???).

These enhancements collectively make Spark 3.0 a more performant, flexible, and cloud‑ready engine for big‑data and AI workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Apache Spark Adaptive Query Execution Dynamic Partition Pruning DataSource V2 Spark 3.0

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.