Big Data 12 min read

Effective Spark GC Tuning: Experiments, Results, and Best Practices

This article walks through a Spark job’s garbage‑collection tuning workflow, presents step‑by‑step experiments with different JVM options and collectors, compares performance under tight and normal memory conditions, and offers practical recommendations for choosing the optimal GC strategy in big‑data workloads.

Data Thinking Notes

Nov 8, 2022

Effective Spark GC Tuning: Experiments, Results, and Best Practices

GC Tuning Process for a Spark Job and Result Comparison

1. Spark Job Tuning Process

1.1 Tuning Workflow

(1) Collect data: Record GC frequency and time using Java options -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps. Logs appear in worker node stdout files, not driver. (2) Check for excessive GC: Multiple Full GC before job completion indicates insufficient memory. (3) If many Young GC but few Full GC, increase Eden size: Set -Xmn=4/3*E where E is desired Eden size.

(4) When old generation is near full: ① Reduce spark.memory.fraction to shrink RDD cache space. ② Decrease new generation size by lowering -Xmn. ③ Adjust JVM -XX:NewRatio (default 2, meaning old gen 2/3 of heap).

(5) Enable G1GC: Use -XX:+UseG1GC and consider -XX:G1HeapRegionSize for large heaps. (6) Estimate memory from HDFS block size (decompressed size ~2‑3× original). (7) After changes, monitor GC frequency and time.

1.2 Memory‑Constrained Scenario

To simulate memory pressure, a skewed key is introduced in Job 3.

(1) Original submission script

nohup /app/hadoop/spark-2.4.3/bin/spark-submit --master spark://hadoop1:7077 --executor-memory 7g --driver-memory 4g --total-executor-cores 3 --conf spark.default.parallelism=9 --conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --class com.demo.Test_GC /app/hadoop/Spark/spark-test-1.0.0-SNAPSHOT.jar > test.log &

Result: Default PS+PO collector triggers many Full GC; job fails after retries.

(2) Increase spark.memory.fraction to 0.8

nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... --conf spark.memory.fraction=0.8 ...

Result: Storage memory larger, but Full GC still occurs and job fails.

(3) Decrease spark.memory.fraction to 0.5

nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... --conf spark.memory.fraction=0.5 ...

Result: Fewer Full GC events; job eventually succeeds.

(4) Increase -XX:NewRatio to 3

nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... -XX:NewRatio=3 ...

Result: Less frequent Full GC; job succeeds.

(5) Switch to CMS collector

nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... -XX:+UseConcMarkSweepGC -XX:+UseParNewGC ...

Result: Higher storage memory, fewer Full GC, job succeeds.

(6) Switch to G1 collector

nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... -XX:+UseG1GC ...

Result: GC time reduced markedly; job succeeds.

1.3 Normal‑Memory Scenario

Without data skew, test jobs with executor memory 20 G, 4 G, 2 G using PS+PO, CMS, and G1 collectors.

(1) 20 G memory

Overall execution times are similar; G1 has lower total GC time but higher CPU usage, so advantage is limited for 20 G tasks.

(2) 4 G memory

PS+PO and CMS outperform G1.

(3) 2 G memory

PS+PO shows the most stable total runtime, better than CMS and G1.

1.4 Summary

(1) In memory‑tight cases, frequent Full GC can be mitigated by:

Reducing spark.memory.fraction.

Increasing -XX:NewRatio to allocate more old‑gen space.

Using CMS or G1 collectors.

(2) In normal‑memory cases, choose collector based on allocated memory:

≤20 G: default PS+PO is generally best.

Large‑scale, multi‑core jobs (tens‑hundreds of GB): consider G1.

CMS offers a middle ground.

(3) Bigger memory allocation is not always beneficial; estimate required memory by caching an RDD and checking Spark’s Storage UI.

Optimization Recommendations

1. Treat Spark tuning as a lifecycle process.

2. Prioritize code optimization, resource allocation, data skew handling, then GC tuning; early GC tweaks may be ineffective.

3. Excessive RDD creation/destruction can cause GC pressure; also watch for competition between execution memory and cached RDDs.

4. For code and resource tuning, see “Spark Memory Model and Optimization”.

Reference

Apache Spark tuning: https://www.iteblog.com/archives/2494.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory gc Spark big-data Tuning

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

GC Tuning Process for a Spark Job and Result Comparison

1. Spark Job Tuning Process

1.1 Tuning Workflow

1.2 Memory‑Constrained Scenario

(1) Original submission script

(2) Increase spark.memory.fraction to 0.8

(3) Decrease spark.memory.fraction to 0.5

(4) Increase -XX:NewRatio to 3

(5) Switch to CMS collector

(6) Switch to G1 collector

1.3 Normal‑Memory Scenario

(1) 20 G memory

(2) 4 G memory

(3) 2 G memory

1.4 Summary

Optimization Recommendations

Reference

Data Thinking Notes

How this landed with the community

Was this worth your time?

0 Comments

(1) 20 G memory

(2) 4 G memory

(3) 2 G memory