Big Data 19 min read

11 Critical Pitfalls to Watch When Upgrading from Spark 3 to Spark 4

Spark 4.0 delivers 20‑50% performance gains and new features like Spark Connect, VARIANT types, and enhanced SQL, but it also introduces breaking changes such as mandatory JDK 17, dropping Scala 2.12, default ANSI mode, removal of Mesos, and altered JDBC type mappings, requiring careful planning and staged migration to avoid runtime failures.

Past Memory Big Data

Apr 13, 2026

11 Critical Pitfalls to Watch When Upgrading from Spark 3 to Spark 4

Hard prerequisites – cannot start without them

Spark 4.0 requires JDK 17 (or newer) as the minimum runtime. JDK 8/11 are unsupported, so CI/CD pipelines, Docker images, EMR/Dataproc clusters, and local development environments must be upgraded. The new JPMS module system may trigger InaccessibleObjectException; adding the JVM flag --add-opens mitigates this during early migration.

Only Scala 2.13 is supported. Required code changes:

Replace scala.collection.JavaConverters with scala.jdk.CollectionConverters.

Change collection conversion syntax from .to[List] to .to(List).

If parallel collections are used ( .par), add the scala-parallel-collections dependency.

Recommended workflow: migrate Scala code to 2.13 and validate on Spark 3.5 before upgrading Spark itself.

Python 3.9+ is required; Python 3.7/3.8 are no longer supported. Ensure all cluster nodes run the newer interpreter.

ANSI bomb – the most accident‑prone change

Spark 4.0 flips spark.sql.ansi.enabled from false to true by default. In Spark 3.x many illegal operations were silently handled; with ANSI mode on they now throw exceptions. 2147483647 + 1 – Spark 3.x returned -2147483648 (wrap‑around); Spark 4.0 throws SparkArithmeticException. CAST('abc' AS INT) – Spark 3.x returned null; Spark 4.0 throws SparkNumberFormatException. 10 / 0 – Spark 3.x returned null; Spark 4.0 throws SparkArithmeticException.

Implicit numeric truncation – silently truncated in Spark 3.x; Spark 4.0 now throws an exception.

Migration strategy (three phases):

Upgrade to Spark 4.0 with spark.sql.ansi.enabled=false so existing applications keep running.

In a test environment enable ANSI mode and fix each thrown exception.

Treat each ANSI exception as a data‑quality issue and perform root‑cause analysis.

For compatibility, use safe functions such as TRY_CAST, TRY_ADD, TRY_DIVIDE or explicit CASE WHEN handling.

More breaking changes to anticipate

CREATE TABLE default source

In Spark 3.x, CREATE TABLE without USING or STORED AS defaulted to Hive tables. Spark 4.0 defaults to the value of spark.sql.sources.default (usually Parquet). Restore old behavior with spark.sql.legacy.createHiveTableByDefault=true.

JDBC type‑mapping overhaul

Major databases see corrected mappings:

PostgreSQL TIMESTAMP WITH TIME ZONE now maps to TimestampType for both read and write.

MySQL SMALLINT maps to ShortType (was IntegerType); FLOAT maps to FloatType (was DoubleType).

SQL Server TINYINT maps to ShortType; DATETIMEOFFSET maps to TimestampType (was StringType).

Audit all JDBC source schemas before migration.

ORC/Parquet default codecs

ORC now defaults to zstd instead of snappy. Parquet’s LZ4 codec name changes from lz4raw to lz4_raw. Downstream systems that do not support zstd may fail to read data.

CTE precedence policy

The default for spark.sql.legacy.ctePrecedencePolicy changes from EXCEPTION to CORRECTED. In nested CTEs, inner definitions now override outer ones instead of raising an error.

Servlet API migration

Imports move from javax to jakarta. Custom Spark Web UI extensions or REST integrations must update import statements.

Infrastructure‑level adjustments

Mesos removal

Spark 4.0 no longer supports Apache Mesos. Choose one of Kubernetes (recommended), YARN, or Standalone for deployment.

Shuffle service backend

The external Shuffle Service storage backend switches from LevelDB to RocksDB, offering higher write throughput and space efficiency. Existing LevelDB state files are not automatically migrated.

Core dependency version jumps

Guava: 14.0.1 → 33.4.0‑jre (high risk, API incompatibility).

Jackson: 2.15.2 → 2.18.2 (medium risk).

Apache Arrow: 12.0.1 → 18.1.0 (medium risk).

Hadoop: 3.3.x → 3.4.1 (medium risk).

If your code uses Guava APIs such as Optional, Cache, or ListenableFuture, carefully audit shading rules and class‑path isolation.

Exciting new capabilities in Spark 4

Spark Connect

A lightweight client‑server architecture. The PySpark client is ~1.5 MB and does not need a full JVM driver. Multi‑language support now includes Go, Rust, and Swift. Enable with spark.api.mode="connect".

VARIANT type

A first‑class semi‑structured column type for JSON or nested maps, enabling efficient queries without flattening.

SQL enhancements

SQL UDFs for reusable functions.

PIPE syntax ( |>) for functional chaining.

SQL scripts with variables and control flow.

Collation support for language‑aware string comparison.

Session variables to help prevent SQL injection.

Python ecosystem upgrades

Native .plot() on DataFrames (Plotly‑based) without converting to Pandas.

Pure‑Python data source API for custom connectors.

Polymorphic UDTFs returning dynamic schemas.

UDF performance profiler.

Performance boost

Adaptive Query Execution improvements (more intelligent join strategy selection, reduced data skew).

Arrow 18 integration reduces JVM‑Python serialization overhead.

RocksDB state backend moves streaming state out of the JVM heap, lowering GC pressure.

Speculative execution defaults become more conservative ( multiplier=3, quantile=0.9), reducing wasted tasks.

Databricks benchmarks show 20‑50 % end‑to‑end speedup on diverse workloads.

Structured streaming maturity

Arbitrary Stateful Processing v2 ( transformWithState) with ValueState, ListState, MapState, TTL, and event‑time timers.

State data source enabling SQL‑like queries over streaming state.

Checkpoint format optimizations (SST reuse, snapshot management).

Migration checklist – actionable plan

Migrate Scala code to 2.13 and validate on Spark 3.5 (P0).

Upgrade build and runtime environments to JDK 17 (including CI/CD, Docker) (P0).

Upgrade to Spark 4.0 with spark.sql.ansi.enabled=false to keep applications running (P0).

Run full test suite and fix compile/runtime errors (P0).

Enable ANSI mode and fix data‑quality exceptions (P1).

Audit JDBC type‑mapping changes (P1).

Check CTE precedence, ORC/Parquet codec defaults, and formatting changes (P1).

Migrate from Mesos to Kubernetes/YARN/Standalone and verify Shuffle Service (P1).

Audit Fat‑Jar dependencies, especially Guava and Jackson (P2).

Update Servlet imports from javax to jakarta (P2).

Explore Spark Connect, VARIANT, and other new features (P3).

Key technical references

Apache Spark 4.0.0 Migration Guide – Official Documentation

Introducing Apache Spark 4.0 – Databricks Blog

Apache Spark 3 to Apache Spark 4 Migration – DZone

Upgrading from Spark 3.x to Spark 4.0: A Practical Guide – Sparking Scala

Apache Spark 4.0: A New Era for Scalable Machine Learning and AI – Databricks Community

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

migration Apache Spark JDK 17 VARIANT ANSI mode Scala 2.13 Spark 4 Spark Connect

Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Hard prerequisites – cannot start without them

ANSI bomb – the most accident‑prone change

More breaking changes to anticipate

CREATE TABLE default source

JDBC type‑mapping overhaul

ORC/Parquet default codecs

CTE precedence policy

Servlet API migration

Infrastructure‑level adjustments

Mesos removal

Shuffle service backend

Core dependency version jumps

Exciting new capabilities in Spark 4

Spark Connect

VARIANT type

SQL enhancements

Python ecosystem upgrades

Performance boost

Structured streaming maturity

Migration checklist – actionable plan

Key technical references

Past Memory Big Data

How this landed with the community

Was this worth your time?

0 Comments

Exciting new capabilities in Spark 4