Big Data 15 min read

Understanding and Resolving Spark OOM Issues: Memory Model, Causes, and Optimization Techniques

This article explains Spark's memory architecture, identifies common out‑of‑memory scenarios such as map‑stage and shuffle‑stage overloads, and provides practical solutions and performance‑tuning tips—including repartitioning, off‑heap storage, code optimizations, and key configuration parameters—to prevent and mitigate OOM errors in large‑scale data processing.

Big Data Technology & Architecture

Dec 23, 2019

Understanding and Resolving Spark OOM Issues: Memory Model, Causes, and Optimization Techniques

In Spark, out‑of‑memory (OOM) problems mainly arise from two situations: memory overflow during the map phase and memory overflow after shuffle operations. Both stem from how Spark allocates memory within each executor.

Spark Memory Model : An executor’s memory is divided into three regions— execution memory (for tasks such as joins, aggregates, and shuffle buffers), storage memory (for cached data, broadcast variables, and persisted RDDs), and other memory (reserved for the JVM and user code). In versions prior to Spark 1.6.0, execution and storage memory were fixed using the parameters spark.shuffle.memoryFraction (default 0.2) and spark.storage.memoryFraction (default 0.6). Since Spark 1.6.0, these regions can borrow from each other, improving overall utilization. Off‑heap memory was also introduced, allowing large objects to be stored outside the JVM heap via StorageLevel.OFF_HEAP (requires Tachyon) or the configuration spark.memory.offHeap.enabled (supported in later releases).

OOM typically occurs in the execution region because storage overflow merely evicts old cached data without crashing the application.

Resolution Methods :

Excessive object creation in map operations : A map that generates many objects (e.g., rdd.map(x => for(i <- 1 to 10000) yield i.toString)) can exhaust execution memory. Reduce task size by calling rdd.repartition(...) before the heavy map, and avoid coalesce which does not trigger a shuffle.

Data skew : Uneven key distribution also leads to OOM; the remedy is similar—rebalance the data with repartition.

Improper use of coalesce : Coalescing from many partitions to few (e.g., coalesce(10)) reduces parallelism and multiplies memory usage per task, causing OOM. Use repartition(10) instead to force a shuffle and keep parallelism.

Shuffle‑stage overflow : Large shuffle files (from joins, reduceByKey, etc.) can overflow memory. Adjust the number of shuffle partitions via spark.default.parallelism (or spark.sql.shuffle.partitions for Spark SQL) or increase partitions in a custom partitioner.

Standalone mode resource imbalance : If --total-executor-cores and --executor-memory are set but --executor-cores is omitted, executors may have unequal core counts, causing some to run many tasks simultaneously and run out of memory. Specify --executor-cores or spark.executor.cores to balance resources.

Shared objects to reduce OOM : Using immutable literals reduces object creation. For example, rdd.flatMap(x => for(i <- 1 to 1000) yield "key"+"value") reuses a single string from the JVM constant pool, whereas ("key","value") creates a new tuple each time.

Performance‑Optimization Tips :

Replace many small map / filter / flatMap calls with a single mapPartitions to reduce intermediate RDD creation.

Use broadcast joins when one side of the join is small to avoid costly data shuffling.

Apply filter before join (predicate push‑down) to shrink shuffle data; Spark SQL already performs this automatically.

Prefer combineByKey over groupByKey followed by aggregation to reduce shuffle volume.

When memory is insufficient, persist with StorageLevel.MEMORY_AND_DISK_SER instead of plain cache() to spill to disk.

Co‑locate Spark and HBase clusters (or ensure Spark nodes cover all HBase regions) to minimize network transfer of large HFiles.

Key Configuration Parameters : spark.driver.memory – increase if the driver’s RDD planning consumes too much memory. spark.rdd.compress – enable to compress serialized RDDs at the cost of CPU. spark.serializer – set to org.apache.spark.serializer.KryoSerializer for faster serialization; register custom classes as needed:

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)

spark.memory.storageFraction

– controls the proportion of executor memory dedicated to storage; adjust if storage or execution dominates. spark.locality.wait – increase to give tasks more time to find local data, especially when partitions are large. spark.speculation – enable to launch duplicate tasks on slow nodes, preventing a single straggler from blocking the job.

References: Jianshu article, CSDN tutorial, Spark SQL optimizer notes, 51CTO guide, and the original CSDN blog post.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

optimization Configuration Memory OOM

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.