How to Fix Spark OOM Errors: Practical Memory & Performance Tuning
This guide analyzes common Spark Out‑Of‑Memory scenarios—such as massive data volumes, data skew, and improper resource allocation—and provides step‑by‑step configurations, memory‑management tweaks, partitioning strategies, and shuffle optimizations to prevent OOM failures in production.
Root Causes of Spark OOM
Out‑Of‑Memory (OOM) in Spark jobs typically stems from four main issues:
Excessive data volume : Large datasets (hundreds of millions of rows or terabytes) cause memory overflow during shuffle or join operations.
Data skew : Uneven data distribution leads some executors to handle far more data than others, exhausting their memory.
Improper resource allocation : Too little executor memory or CPU, or excessive caching, leaves insufficient heap space.
Excessive or inefficient caching : Overuse of cache() or persist() without timely unpersisting retains large RDDs in memory.
Solution 1 – Adjust Executor Memory and CPU
Allocate enough memory and cores to each executor.
Increase executor memory, e.g.: --executor-memory 8G Enable and size off‑heap memory when needed:
--conf spark.memory.offHeap.enabled=true
--conf spark.memory.offHeap.size=4GIncrease executor cores to speed up processing:
--executor-cores 4Solution 2 – Tune Spark Memory Management
Modify the unified memory manager parameters to balance storage and execution memory.
Adjust memory fraction and storage fraction:
--conf spark.memory.fraction=0.8
--conf spark.memory.storageFraction=0.5Release cached data promptly: rdd.unpersist() Use a less memory‑intensive storage level, e.g.:
rdd.persist(StorageLevel.MEMORY_AND_DISK)Solution 3 – Data Partitioning and Operation Optimization
Shuffle‑heavy operations (shuffle, join, groupBy) consume large memory. Apply the following tactics:
Increase the number of partitions to avoid large partitions:
rdd.repartition(200)
// or
rdd.reduceByKey(_ + _, numPartitions = 200)Avoid wide dependencies such as groupByKey; prefer reduceByKey or other aggregations with pre‑combining.
Mitigate data skew:
Random key prefixing to spread data:
rdd.map(x => ((x._1 + new Random().nextInt(10)), x._2))Broadcast small tables during joins:
val broadcastVar = sc.broadcast(smallTable)
largeTable.mapPartitions { partition =>
val small = broadcastVar.value
partition.map(largeRow => ...)
}Solution 4 – Adjust Parallelism and Shuffle Settings
Increase the default shuffle parallelism and enable Adaptive Query Execution (AQE) to let Spark rebalance partitions automatically.
Set a higher shuffle partition count: --conf spark.sql.shuffle.partitions=200 Enable AQE and define target post‑shuffle size:
--conf spark.sql.adaptive.enabled=true
--conf spark.sql.adaptive.shuffle.targetPostShuffleInputSize=64MConclusion
To resolve Spark OOM issues, combine the following practices:
Allocate sufficient memory and CPU to executors and tune off‑heap settings.
Adjust memory management fractions and clean or downgrade cache usage.
Increase partition counts, replace wide dependencies with aggregations, and handle data skew via random keys or broadcast joins.
Raise shuffle parallelism and enable AQE for dynamic partition adjustment.
These measures, together with JVM tuning and hardware upgrades when necessary, can significantly reduce the likelihood of OOM failures in Spark jobs.
Java Architecture Stack
Dedicated to original, practical tech insights—from skill advancement to architecture, front‑end to back‑end, the full‑stack path, with Wei Ge guiding you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
