Effective Spark GC Tuning: Experiments, Results, and Best Practices
This article walks through a Spark job’s garbage‑collection tuning workflow, presents step‑by‑step experiments with different JVM options and collectors, compares performance under tight and normal memory conditions, and offers practical recommendations for choosing the optimal GC strategy in big‑data workloads.
GC Tuning Process for a Spark Job and Result Comparison
1. Spark Job Tuning Process
1.1 Tuning Workflow
(1) Collect data: Record GC frequency and time using Java options -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps. Logs appear in worker node stdout files, not driver. (2) Check for excessive GC: Multiple Full GC before job completion indicates insufficient memory. (3) If many Young GC but few Full GC, increase Eden size: Set -Xmn=4/3*E where E is desired Eden size.
(4) When old generation is near full: ① Reduce spark.memory.fraction to shrink RDD cache space. ② Decrease new generation size by lowering -Xmn. ③ Adjust JVM -XX:NewRatio (default 2, meaning old gen 2/3 of heap).
(5) Enable G1GC: Use -XX:+UseG1GC and consider -XX:G1HeapRegionSize for large heaps. (6) Estimate memory from HDFS block size (decompressed size ~2‑3× original). (7) After changes, monitor GC frequency and time.
1.2 Memory‑Constrained Scenario
To simulate memory pressure, a skewed key is introduced in Job 3.
(1) Original submission script
<code>nohup /app/hadoop/spark-2.4.3/bin/spark-submit --master spark://hadoop1:7077 --executor-memory 7g --driver-memory 4g --total-executor-cores 3 --conf spark.default.parallelism=9 --conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --class com.demo.Test_GC /app/hadoop/Spark/spark-test-1.0.0-SNAPSHOT.jar > test.log &</code>Result: Default PS+PO collector triggers many Full GC; job fails after retries.
(2) Increase spark.memory.fraction to 0.8
<code>nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... --conf spark.memory.fraction=0.8 ...</code>Result: Storage memory larger, but Full GC still occurs and job fails.
(3) Decrease spark.memory.fraction to 0.5
<code>nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... --conf spark.memory.fraction=0.5 ...</code>Result: Fewer Full GC events; job eventually succeeds.
(4) Increase -XX:NewRatio to 3
<code>nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... -XX:NewRatio=3 ...</code>Result: Less frequent Full GC; job succeeds.
(5) Switch to CMS collector
<code>nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... -XX:+UseConcMarkSweepGC -XX:+UseParNewGC ...</code>Result: Higher storage memory, fewer Full GC, job succeeds.
(6) Switch to G1 collector
<code>nohup /app/hadoop/spark-2.4.3/bin/spark-submit ... -XX:+UseG1GC ...</code>Result: GC time reduced markedly; job succeeds.
1.3 Normal‑Memory Scenario
Without data skew, test jobs with executor memory 20 G, 4 G, 2 G using PS+PO, CMS, and G1 collectors.
(1) 20 G memory
Overall execution times are similar; G1 has lower total GC time but higher CPU usage, so advantage is limited for 20 G tasks.
(2) 4 G memory
PS+PO and CMS outperform G1.
(3) 2 G memory
PS+PO shows the most stable total runtime, better than CMS and G1.
1.4 Summary
(1) In memory‑tight cases, frequent Full GC can be mitigated by:
Reducing spark.memory.fraction.
Increasing -XX:NewRatio to allocate more old‑gen space.
Using CMS or G1 collectors.
(2) In normal‑memory cases, choose collector based on allocated memory:
≤20 G: default PS+PO is generally best.
Large‑scale, multi‑core jobs (tens‑hundreds of GB): consider G1.
CMS offers a middle ground.
(3) Bigger memory allocation is not always beneficial; estimate required memory by caching an RDD and checking Spark’s Storage UI.
Optimization Recommendations
1. Treat Spark tuning as a lifecycle process.
2. Prioritize code optimization, resource allocation, data skew handling, then GC tuning; early GC tweaks may be ineffective.
3. Excessive RDD creation/destruction can cause GC pressure; also watch for competition between execution memory and cached RDDs.
4. For code and resource tuning, see “Spark Memory Model and Optimization”.
Reference
Apache Spark tuning: https://www.iteblog.com/archives/2494.html
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.