Big Data 8 min read

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

This guide analyzes common Spark Out‑Of‑Memory scenarios—such as massive data volumes, data skew, and improper resource allocation—and provides step‑by‑step configurations, memory‑management tweaks, partitioning strategies, and shuffle optimizations to prevent OOM failures in production.

Java Architecture Stack

Oct 18, 2024

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

Root Causes of Spark OOM

Out‑Of‑Memory (OOM) in Spark jobs typically stems from four main issues:

Excessive data volume : Large datasets (hundreds of millions of rows or terabytes) cause memory overflow during shuffle or join operations.

Data skew : Uneven data distribution leads some executors to handle far more data than others, exhausting their memory.

Improper resource allocation : Too little executor memory or CPU, or excessive caching, leaves insufficient heap space.

Excessive or inefficient caching : Overuse of cache() or persist() without timely unpersisting retains large RDDs in memory.

Solution 1 – Adjust Executor Memory and CPU

Allocate enough memory and cores to each executor.

Increase executor memory, e.g.: --executor-memory 8G Enable and size off‑heap memory when needed:

--conf spark.memory.offHeap.enabled=true
--conf spark.memory.offHeap.size=4G

Increase executor cores to speed up processing:

--executor-cores 4

Solution 2 – Tune Spark Memory Management

Modify the unified memory manager parameters to balance storage and execution memory.

Adjust memory fraction and storage fraction:

--conf spark.memory.fraction=0.8
--conf spark.memory.storageFraction=0.5

Release cached data promptly: rdd.unpersist() Use a less memory‑intensive storage level, e.g.:

rdd.persist(StorageLevel.MEMORY_AND_DISK)

Solution 3 – Data Partitioning and Operation Optimization

Shuffle‑heavy operations (shuffle, join, groupBy) consume large memory. Apply the following tactics:

Increase the number of partitions to avoid large partitions:

rdd.repartition(200)
// or
rdd.reduceByKey(_ + _, numPartitions = 200)

Avoid wide dependencies such as groupByKey; prefer reduceByKey or other aggregations with pre‑combining.

Mitigate data skew:

Random key prefixing to spread data:

rdd.map(x => ((x._1 + new Random().nextInt(10)), x._2))

Broadcast small tables during joins:

val broadcastVar = sc.broadcast(smallTable)
largeTable.mapPartitions { partition =>
  val small = broadcastVar.value
  partition.map(largeRow => ...)
}

Solution 4 – Adjust Parallelism and Shuffle Settings

Increase the default shuffle parallelism and enable Adaptive Query Execution (AQE) to let Spark rebalance partitions automatically.

Set a higher shuffle partition count: --conf spark.sql.shuffle.partitions=200 Enable AQE and define target post‑shuffle size:

--conf spark.sql.adaptive.enabled=true
--conf spark.sql.adaptive.shuffle.targetPostShuffleInputSize=64M

Conclusion

To resolve Spark OOM issues, combine the following practices:

Allocate sufficient memory and CPU to executors and tune off‑heap settings.

Adjust memory management fractions and clean or downgrade cache usage.

Increase partition counts, replace wide dependencies with aggregations, and handle data skew via random keys or broadcast joins.

Raise shuffle parallelism and enable AQE for dynamic partition adjustment.

These measures, together with JVM tuning and hardware upgrades when necessary, can significantly reduce the likelihood of OOM failures in Spark jobs.

Performance optimization big data OOM Spark Scala Memory Tuning

Written by

Java Architecture Stack

Dedicated to original, practical tech insights—from skill advancement to architecture, front‑end to back‑end, the full‑stack path, with Wei Ge guiding you.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.