Big Data 9 min read

Spark Memory Management and Tuning Practices for Large-Scale Billing Systems

This article explains how Spark's memory management models and configuration parameters can be tuned to handle massive billing data efficiently, covering StaticMemoryManager vs UnifiedMemoryManager, storage and shuffle memory fractions, common OOM and file‑not‑found issues, and practical performance‑optimisation tips.

JD Tech
JD Tech
JD Tech
Spark Memory Management and Tuning Practices for Large-Scale Billing Systems

Background : The finance department needs to bill and settle data between the logistics headquarters and subsidiaries, requiring massive data processing each month. Spark’s flexible DAG programming model and multi‑source heterogeneous data handling improve development efficiency and enable complex billing tasks.

Business Overview : The system cleans, distributes, and stores billing documents, quote data, and master data. The billing engine applies formulas per business type, storing results in a detail database for settlement.

Preliminary Knowledge – Spark Memory Management Modes : Spark 1.6.0 and earlier use StaticMemoryManager; later versions use UnifiedMemoryManager. The mode can be switched via spark.memory.useLegacyMode (default false).

In StaticMemoryManager, heap space is divided into Storage and Shuffle regions. The Storage fraction is set by spark.storage.memoryFraction (default 0.6) and safety fraction spark.storage.safetyFraction (default 0.9), giving an effective storage memory of 0.54 of the heap.

When many RDDs are persisted, increase spark.storage.memoryFraction. When shuffle‑heavy jobs dominate, lower spark.storage.memoryFraction and raise spark.shuffle.memoryFraction to avoid spilling.

UnifiedMemoryManager merges Spark Memory and User Memory, with Storage Memory and Execution Memory sharing space dynamically.

Key Parameters (default values):

spark.yarn.executor.memoryOverhead

executorMemory * 0.10, with minimum of 384

Off‑heap memory allocated to each driver in cluster mode.

spark.shuffle.memoryFraction

0.2

Proportion of executor memory used for shuffle aggregation.

spark.shuffle.io.maxRetries

3

Number of automatic retries when shuffle file fetch fails due to I/O errors.

spark.shuffle.io.retryWait

5s

Wait interval between successive shuffle file fetch retries.

Common issues observed include “file not found” errors caused by insufficient executor off‑heap memory, leading to task failures and job crashes. Increasing spark.yarn.executor.memoryOverhead to ≥1 GB mitigates OOM and improves performance.

Frequent GC pauses can be reduced by adjusting spark.memory.fraction, -Xmn, and spark.shuffle.io.maxRetries / spark.shuffle.io.retryWait. Monitoring Spark Web UI for spill events helps fine‑tune spark.shuffle.memoryFraction.

Conclusion : Proper tuning of Spark memory parameters ensures accurate, timely billing for billions of records, maintains high throughput during peak loads, and provides reliable monitoring and logging for issue diagnosis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Memory Managementperformance tuningdistributed computingSpark
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.