11 Critical Pitfalls to Watch When Upgrading from Spark 3 to Spark 4
Spark 4.0 delivers 20‑50% performance gains and new features like Spark Connect, VARIANT types, and enhanced SQL, but it also introduces breaking changes such as mandatory JDK 17, dropping Scala 2.12, default ANSI mode, removal of Mesos, and altered JDBC type mappings, requiring careful planning and staged migration to avoid runtime failures.
Hard prerequisites – cannot start without them
Spark 4.0 requires JDK 17 (or newer) as the minimum runtime. JDK 8/11 are unsupported, so CI/CD pipelines, Docker images, EMR/Dataproc clusters, and local development environments must be upgraded. The new JPMS module system may trigger InaccessibleObjectException; adding the JVM flag --add-opens mitigates this during early migration.
Only Scala 2.13 is supported. Required code changes:
Replace scala.collection.JavaConverters with scala.jdk.CollectionConverters.
Change collection conversion syntax from .to[List] to .to(List).
If parallel collections are used ( .par), add the scala-parallel-collections dependency.
Recommended workflow: migrate Scala code to 2.13 and validate on Spark 3.5 before upgrading Spark itself.
Python 3.9+ is required; Python 3.7/3.8 are no longer supported. Ensure all cluster nodes run the newer interpreter.
ANSI bomb – the most accident‑prone change
Spark 4.0 flips spark.sql.ansi.enabled from false to true by default. In Spark 3.x many illegal operations were silently handled; with ANSI mode on they now throw exceptions. 2147483647 + 1 – Spark 3.x returned -2147483648 (wrap‑around); Spark 4.0 throws SparkArithmeticException. CAST('abc' AS INT) – Spark 3.x returned null; Spark 4.0 throws SparkNumberFormatException. 10 / 0 – Spark 3.x returned null; Spark 4.0 throws SparkArithmeticException.
Implicit numeric truncation – silently truncated in Spark 3.x; Spark 4.0 now throws an exception.
Migration strategy (three phases):
Upgrade to Spark 4.0 with spark.sql.ansi.enabled=false so existing applications keep running.
In a test environment enable ANSI mode and fix each thrown exception.
Treat each ANSI exception as a data‑quality issue and perform root‑cause analysis.
For compatibility, use safe functions such as TRY_CAST, TRY_ADD, TRY_DIVIDE or explicit CASE WHEN handling.
More breaking changes to anticipate
CREATE TABLE default source
In Spark 3.x, CREATE TABLE without USING or STORED AS defaulted to Hive tables. Spark 4.0 defaults to the value of spark.sql.sources.default (usually Parquet). Restore old behavior with spark.sql.legacy.createHiveTableByDefault=true.
JDBC type‑mapping overhaul
Major databases see corrected mappings:
PostgreSQL TIMESTAMP WITH TIME ZONE now maps to TimestampType for both read and write.
MySQL SMALLINT maps to ShortType (was IntegerType); FLOAT maps to FloatType (was DoubleType).
SQL Server TINYINT maps to ShortType; DATETIMEOFFSET maps to TimestampType (was StringType).
Audit all JDBC source schemas before migration.
ORC/Parquet default codecs
ORC now defaults to zstd instead of snappy. Parquet’s LZ4 codec name changes from lz4raw to lz4_raw. Downstream systems that do not support zstd may fail to read data.
CTE precedence policy
The default for spark.sql.legacy.ctePrecedencePolicy changes from EXCEPTION to CORRECTED. In nested CTEs, inner definitions now override outer ones instead of raising an error.
Servlet API migration
Imports move from javax to jakarta. Custom Spark Web UI extensions or REST integrations must update import statements.
Infrastructure‑level adjustments
Mesos removal
Spark 4.0 no longer supports Apache Mesos. Choose one of Kubernetes (recommended), YARN, or Standalone for deployment.
Shuffle service backend
The external Shuffle Service storage backend switches from LevelDB to RocksDB, offering higher write throughput and space efficiency. Existing LevelDB state files are not automatically migrated.
Core dependency version jumps
Guava: 14.0.1 → 33.4.0‑jre (high risk, API incompatibility).
Jackson: 2.15.2 → 2.18.2 (medium risk).
Apache Arrow: 12.0.1 → 18.1.0 (medium risk).
Hadoop: 3.3.x → 3.4.1 (medium risk).
If your code uses Guava APIs such as Optional, Cache, or ListenableFuture, carefully audit shading rules and class‑path isolation.
Exciting new capabilities in Spark 4
Spark Connect
A lightweight client‑server architecture. The PySpark client is ~1.5 MB and does not need a full JVM driver. Multi‑language support now includes Go, Rust, and Swift. Enable with spark.api.mode="connect".
VARIANT type
A first‑class semi‑structured column type for JSON or nested maps, enabling efficient queries without flattening.
SQL enhancements
SQL UDFs for reusable functions.
PIPE syntax ( |>) for functional chaining.
SQL scripts with variables and control flow.
Collation support for language‑aware string comparison.
Session variables to help prevent SQL injection.
Python ecosystem upgrades
Native .plot() on DataFrames (Plotly‑based) without converting to Pandas.
Pure‑Python data source API for custom connectors.
Polymorphic UDTFs returning dynamic schemas.
UDF performance profiler.
Performance boost
Adaptive Query Execution improvements (more intelligent join strategy selection, reduced data skew).
Arrow 18 integration reduces JVM‑Python serialization overhead.
RocksDB state backend moves streaming state out of the JVM heap, lowering GC pressure.
Speculative execution defaults become more conservative ( multiplier=3, quantile=0.9), reducing wasted tasks.
Databricks benchmarks show 20‑50 % end‑to‑end speedup on diverse workloads.
Structured streaming maturity
Arbitrary Stateful Processing v2 ( transformWithState) with ValueState, ListState, MapState, TTL, and event‑time timers.
State data source enabling SQL‑like queries over streaming state.
Checkpoint format optimizations (SST reuse, snapshot management).
Migration checklist – actionable plan
Migrate Scala code to 2.13 and validate on Spark 3.5 (P0).
Upgrade build and runtime environments to JDK 17 (including CI/CD, Docker) (P0).
Upgrade to Spark 4.0 with spark.sql.ansi.enabled=false to keep applications running (P0).
Run full test suite and fix compile/runtime errors (P0).
Enable ANSI mode and fix data‑quality exceptions (P1).
Audit JDBC type‑mapping changes (P1).
Check CTE precedence, ORC/Parquet codec defaults, and formatting changes (P1).
Migrate from Mesos to Kubernetes/YARN/Standalone and verify Shuffle Service (P1).
Audit Fat‑Jar dependencies, especially Guava and Jackson (P2).
Update Servlet imports from javax to jakarta (P2).
Explore Spark Connect, VARIANT, and other new features (P3).
Key technical references
Apache Spark 4.0.0 Migration Guide – Official Documentation
Introducing Apache Spark 4.0 – Databricks Blog
Apache Spark 3 to Apache Spark 4 Migration – DZone
Upgrading from Spark 3.x to Spark 4.0: A Practical Guide – Sparking Scala
Apache Spark 4.0: A New Era for Scalable Machine Learning and AI – Databricks Community
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
