Big Data 14 min read

Upgrading Spark from 1.4.1 to 1.6.1: Memory, Storage, and Operational Challenges

The article details the author’s experience upgrading a production Spark cluster from version 1.4.1 to 1.6.1, exposing memory‑spill, unified memory, BlockManager deadlock, Yarn‑kill, UI quirks, and Spark‑SQL compatibility issues, and proposes concrete code‑level fixes for each problem.

Architecture Digest

May 4, 2016

Upgrading Spark from 1.4.1 to 1.6.1: Memory, Storage, and Operational Challenges

The author, a data‑architecture engineer at Didi, reports on more than a year of using Spark for batch and machine‑learning workloads and describes the practical problems encountered when upgrading the production cluster from Spark 1.4.1 to Spark 1.6.1 under YARN.

Memory‑spill issue : Large tasks can trigger the exception “Unable to acquire 163 bytes of memory, got 0”. The root cause lies in the Tungsten‑sort spill mechanism, where a Spillable object (ExternalAppendOnlyMap/ExternalSorter) may consume memory without registering, leading to out‑of‑memory errors. Two mitigation tactics are suggested – increasing the number of partitions or forcing a spill via spark.shuffle.spill.numElementsForceSpillThreshold – but both have drawbacks. A more robust fix is to modify the Spillable interface to inherit from MemoryConsumer and add a new configuration spark.shuffle.spill.memoryForceSpillThreshold (default 640 MB) that forces a spill when the current memory exceeds the threshold.

Unified memory management : Prior to Spark 1.6, storage and execution memory were isolated, wasting resources. Spark 1.6 introduces Unified Memory Management, sharing 75 % of executor memory between the two. However, the author observed that during shuffle the storage memory can monopolise the pool, causing delays. The proposed adjustment disables storage borrowing from execution memory while allowing execution memory to borrow from idle storage memory by setting spark.unifiedMemory.useStaticStorageRegion and tweaking maxStorageMemory and acquireStorageMemory in UnifiedMemoryManager:

override def maxStorageMemory: Long = synchronized { if (useStaticStorageMemory) { storageRegionSize } else { maxMemory - onHeapExecutionMemoryPool.memoryUsed } }

if (useStaticStorageMemory && (storageRegionSize - storageMemoryPool.poolSize) < onHeapExecutionMemoryPool.memoryFree) { maxBorrowMemory = storageRegionSize - storageMemoryPool.poolSize }

BlockManager deadlock : When cache data is large, BlockManager may deadlock because BlockInfo lacks proper read/write locks. The author suggests adding a global ConcurrentHashMap[BlockId, Long] to track which task holds a lock, checking it before acquiring a lock, and releasing all locks associated with a task upon completion.

YARN memory‑over‑commit kill : Spark’s default memory overhead (10 % of executor memory, minimum 384 MB) can cause YARN to kill containers. Adjusting spark.yarn.executor.memoryOverhead proportionally (e.g., 20 %) and reducing spark.memory.fraction mitigates the issue, though the author notes the need for better defaults.

UI and usability complaints : The author lists several UI annoyances (IP‑only worker list, misplaced worker list, duplicate SQL tabs) and noisy log messages from Spark‑SQL startup, recommending source‑code tweaks (e.g., redirecting output to stderr).

Spark‑SQL pros and cons : Performance improves by >20 % in Spark 1.6 versus 1.4, but syntax compatibility remains problematic (e.g., mixed double and decimal types cause errors, tracked as SPARK‑13772).

Conclusion : Upgrading to Spark 1.6 yields noticeable speed gains but introduces several memory‑related bugs that remain unresolved in the latest 1.6.1 release. The author encourages the community to adopt the provided patches or submit PRs to address these shortcomings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Memory Management Distributed Computing YARN Spark Shuffle

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.