Mastering HotSpot CMS GC: Common Scenarios, Root Causes, and Optimization Strategies
This comprehensive guide explains the fundamentals of HotSpot CMS garbage collection, identifies nine typical GC problem scenarios, analyzes their root causes, and provides practical tuning strategies, code examples, and diagnostic tools to help Java engineers optimize performance and avoid costly pauses.
1. Introduction
This article focuses on the "CMS + ParNew" combination in HotSpot VM, summarizing usage scenarios, source code analysis, and troubleshooting methods. It assumes a certain level of expertise and may contain many technical terms.
1.1 Scope
The article is organized into four major steps: knowledge building, evaluation criteria, scenario analysis, and summary.
2. GC Basics
2.1 Basic Concepts
GC : Three semantics – Garbage Collection (noun), Garbage Collector (noun), Garbage Collecting (verb).
Mutator : The application that creates garbage.
TLAB : Thread‑Local Allocation Buffer, a per‑thread allocation area that avoids lock contention.
Card Table : A data structure that records the state of memory cards to handle cross‑generation references.
2.2 JVM Memory Layout
The heap consists of Young and Old generations, plus MetaSpace. Direct memory (off‑heap) is managed via Cleaner .
2.3 Object Allocation
Free list allocation – uses extra storage to record free addresses.
Bump pointer allocation – moves a pointer forward; fast but limited.
2.4 Collection Algorithms
Mark‑Sweep : Two phases – marking reachable objects and sweeping dead ones.
Mark‑Compact : After marking, objects are compacted to eliminate fragmentation.
Copying : The heap is split into two halves; live objects are copied from one half to the other.
Time complexity depends on heap size L and heap capacity H .
2.5 Collectors
Generational (e.g., CMS, G1) and regional collectors (e.g., ZGC, Shenandoah).
CMS and G1 are the most widely used in production.
2.6 Common Tools
Command‑line: jps , jinfo , jstat , jstack , jmap , jcmd , vjtools , arthas , greys .
GUI: JConsole, VisualVM, JProfiler, MAT.
For DirectByteBuffer, System.gc() is often used to trigger reclamation.
3. How to Determine Whether a GC Issue Exists
Key indicators:
Latency (max pause time) should be below the service's TP9999.
Throughput (percentage of time spent in mutator code) should be ≥ 99.99%.
Four analysis methods are recommended:
Temporal analysis – identify which metric deviates first.
Probability analysis – use historical data to infer likely causes.
Experiment analysis – reproduce the issue in a controlled environment.
Counter‑factual analysis – verify whether removing a symptom changes the outcome.
4. Common Scenarios and Solutions
4.1 Scenario 1 – Space Shock During Dynamic Expansion
Symptom : Frequent GC at startup despite plenty of free space; GC cause is Allocation Failure .
Root Cause : Inconsistent -Xms and -Xmx cause the heap to expand on demand, triggering GC.
Solution : Set -Xms and -Xmx to the same value; tune -XX:MinHeapFreeRatio and -XX:MaxHeapFreeRatio to control expansion/shrink thresholds.
4.2 Scenario 2 – Explicit GC (System.gc())
Symptom : GC occurs without obvious memory pressure; cause is System.gc() .
Root Cause : System.gc() forces a full stop‑the‑world GC, which can be costly.
Solution : Keep System.gc() when needed (e.g., to release DirectByteBuffer), but consider adding -XX:+ExplicitGCInvokesConcurrent or -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses to make it concurrent and avoid long pauses.
4.3 Scenario 3 – MetaSpace OOM
Symptom : MetaSpace usage continuously grows; GC cannot reclaim it.
Root Cause : Classes keep being loaded (often via dynamic class loading) and are retained by their class loaders.
Solution : Fix class‑loader leaks, monitor ClassLoading metrics, and optionally set fixed -XX:MetaSpaceSize and -XX:MaxMetaSpaceSize values.
4.4 Scenario 4 – Premature Promotion
Symptom : Objects are promoted to Old generation too early, leading to frequent Young GC and high pause times.
Root Cause : Small Young/Eden space or high allocation rate causes objects to survive only a few GC cycles.
Solution : Increase Young generation size (e.g., adjust -Xmn or -XX:NewRatio ), and tune -XX:MaxTenuringThreshold to delay promotion.
4.5 Scenario 5 – Frequent CMS Old GC
Symptom : CMS runs often, reducing throughput.
Root Cause : CMS thread polls shouldConcurrentCollect() and triggers collection based on occupancy, allocation failures, or time intervals.
Solution : Lower -XX:CMSInitiatingOccupancyFraction , enable -XX:+UseCMSInitiatingOccupancyOnly , or switch to a collector with lower pause impact (e.g., G1, ZGC).
4.6 Scenario 6 – Long CMS Old GC Pauses
Symptom : Individual CMS pauses exceed 1 s, sometimes up to 8 s.
Root Cause : The stop‑the‑world phases (Initial Mark and Final Remark) take long due to reference processing, class unloading, or large survivor sets.
Solution : Enable parallel reference processing ( -XX:+ParallelRefProcEnabled ), disable class unloading ( -XX:-CMSClassUnloadingEnabled ) if not needed, and monitor -XX:+PrintReferenceGC for hotspots.
4.7 Scenario 7 – Memory Fragmentation & Collector Degradation
Symptom : CMS degrades to a single‑threaded Full GC (MSC) with very long pauses.
Root Causes : Promotion failures due to fragmentation, incremental‑collection guarantee failures, or concurrent‑mode failures.
Solutions : Enable compaction at full GC: -XX:UseCMSCompactAtFullCollection=true and control frequency with -XX:CMSFullGCsBeforeCompaction=n . Reduce -XX:CMSInitiatingOccupancyFraction and use -XX:+UseCMSInitiatingOccupancyOnly to trigger earlier collections. Use -XX:+CMSScavengeBeforeRemark to perform a Young GC before the remark phase.
4.8 Scenario 8 – Off‑Heap (Direct) Memory OOM
Symptom : Process RES exceeds -Xmx , swap usage rises, and GC times increase.
Root Causes : Unreleased DirectByteBuffer or Unsafe.allocateMemory allocations. Native code (via JNI) that allocates memory without proper free.
Diagnostic Steps : Enable Native Memory Tracking: -XX:NativeMemoryTracking=detail and run jcmd PID VM.native_memory detail to see off‑heap usage. Monitor java.nio.Bits.totalCapacity (NIO) or io.netty.util.internal.PlatformDependent.DIRECT_MEMORY_COUNTER (Netty). Use gperftools or BTrace to locate native allocations.
Solutions : Ensure proper release of DirectByteBuffers, avoid -XX:+DisableExplicitGC if you rely on System.gc() for cleanup, and fix native leaks.
4.9 Scenario 9 – JNI‑Induced GC Stalls
Symptom : GC logs show GCLocker Initiated GC and pauses.
Root Cause : Native code accesses Java objects via GetPrimitiveArrayCritical or similar, which blocks GC until the array is released.
Solution : Minimize the duration of critical sections, add -XX:+PrintJNIGCStalls to identify offending threads, and consider upgrading to JDK 14 (fixes JDK‑8048556).
5. Summary and Best Practices
5.1 SOP : Establish GC standards, retain crash dumps, perform causal analysis, and use the 5‑Why method to pinpoint root causes.
5.2 Root‑Cause Fishbone : Visual diagram (omitted) helps eliminate unrelated factors.
5.3 Tuning Advice : Balance latency, throughput, and capacity; use control‑variable experiments; prefer proven solutions over blind parameter changes.
5.4 Common Pitfalls : Over‑reliance on System.gc() , disabling biased locking ( -XX:-UseBiasedLocking ) when contention is high, and forgetting -XX:+AlwaysPreTouch for large heaps.
6. References
Garbage Collection Algorithms and Implementations – Nakamura & Aikawa.
The Garbage Collection Handbook – Jones, Hosking, Moss.
Deep Dive into the JVM – Zhou Zhimin.
HotSpot GC Tuning Guide.
Shipilev’s One‑Page Blog.
OpenJDK 15 project.
Java Community Process (JCP).
A Generational Mostly‑concurrent Garbage Collector – Printezis & Detlefs.
Java Memory Management White Paper.
Stuff Happens: Understanding Causation in Policy and Strategy – AA Hill.
Sanyou's Java Diary
Passionate about technology, though not great at solving problems; eager to share, never tire of learning!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.