How to Diagnose Frequent Full GC in Java Interviews
This article explains the root‑cause analysis and step‑by‑step troubleshooting process for frequent Full GC events in Java applications, covering trigger mechanisms, impact assessment, common causes, monitoring tools, heap‑dump analysis, and both short‑term fixes and long‑term architectural improvements.
Full GC Overview
Full GC is triggered when the Old Generation or Metaspace runs out of space. The garbage collector (e.g., G1, CMS, Serial Old) performs a full‑heap collection, stops all application threads (STW), consumes significant CPU and memory, and can cause latency spikes or service crashes.
Typical Trigger Scenarios
Old Generation Exhaustion : Large objects allocated directly to Old Gen or rapid promotion of survivor objects fill Old Gen beyond ~90%.
Metaspace Overflow : Excessive dynamic class generation (e.g., Groovy scripts, CGLib proxies) exceeds Metaspace limits.
Explicit System.gc() : Application code or framework‑initiated Full GC calls.
GC Algorithm Failures : CMS Concurrent Mode Failure or G1 Evacuation Pause that cannot allocate space.
Impact of Frequent Full GC
Service availability drops as STW pauses accumulate.
CPU consumption creates a feedback loop: longer pauses generate more objects, which increase GC frequency.
In distributed systems a single node’s Full GC can cascade into a cluster‑wide outage (death spiral).
Root‑Cause Classification
Local cache over‑allocation : Old Gen >80% occupied by a ConcurrentHashMap or @Cacheable entries. Remediation : Replace with Caffeine/Guava (TTL) or externalize to Redis, shard large keys.
Message bloat : Kafka messages >512 KB create large temporary objects. Remediation : Send only IDs, enable Snappy/LZ4 compression, split large payloads.
Database query explosion : Unpaginated SELECT * returns multi‑megabyte result sets. Remediation : Enforce pagination, use cursor streaming, select only required columns.
ThreadLocal leakage : ThreadLocal objects persist across thread‑pool reuse. Remediation : Always call remove() in a finally block or use TransmittableThreadLocal.
Reflection/ASM abuse : Massive dynamic class generation fills Metaspace. Remediation : Cache reflective Method / Constructor objects, limit class‑loader creation, close GroovyClassLoader after use.
Improper JVM parameters : Small young generation, aggressive G1 pause targets, etc. Remediation : Tune -XX:NewRatio, -XX:SurvivorRatio, -XX:MaxGCPauseMillis, -XX:InitiatingHeapOccupancyPercent.
Four‑Step Investigation Process
Data Collection : Enable detailed GC logging (e.g.,
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xlog:gc*:file=/var/log/jvm/gc.log) and configure heap‑dump on Full GC ( jmap -dump:live,format=b,file=heap.hprof <pid>).
GC Log Analysis : Identify trigger type via keywords such as [Full GC (Ergonomics)], [Full GC (Metadata GC Threshold)]; measure STW duration; evaluate post‑GC memory reclamation.
Heap Dump Inspection : Use MAT or Arthas to examine the dominator tree, locate large retained objects, trace GC roots, and analyze class‑loader statistics for Metaspace issues.
Root‑Cause Validation : Correlate findings with code (e.g., static cache size, missing ThreadLocal cleanup), reproduce the scenario, and verify that the remedial action reduces Full GC frequency.
Tool Recommendations
GC log analysis: GCeasy , jstat, Prometheus + Grafana.
Heap dump analysis: MAT (Dominators, Leak Suspects) or Arthas ( heapdump command).
Online diagnostics in containers: jstat -gc <pid> 1s, jmap -dump:live,file=/tmp/heap.hprof <pid>, sidecar containers for log collection.
Emergency Fixes (Minutes‑to‑Hours)
Adjust JVM flags: increase young generation, raise Metaspace limits, set -XX:PretenureSizeThreshold for large objects.
Temporarily clear static caches via admin endpoints or restart the service.
Apply rate limiting or circuit breaking to reduce object‑creation bursts.
Short‑Term Optimizations (1‑3 Days)
Replace raw ConcurrentHashMap caches with Caffeine/Guava (TTL, maximumSize).
Externalize large caches to Redis or Memcached.
Introduce pagination or cursor‑based streaming for bulk DB queries.
Compress large messages (Snappy/LZ4) and trim payloads.
Ensure proper ThreadLocal cleanup (try‑finally or TTL).
Cache reflective MethodHandle instead of repeated reflection.
Mid‑Term Optimizations (Weeks‑Months)
Capacity planning and load testing to size heap, young generation, and Metaspace.
Adopt multi‑level caching (local + distributed) with sharding for large keys.
Migrate batch jobs to streaming frameworks (Flink, Spark Streaming) to avoid full‑dataset loading.
Standardize monitoring dashboards (Full GC frequency, Old Gen usage, request latency) and alert thresholds.
Automate heap‑dump collection on alert via sidecar or APM integration (SkyWalking, Pinpoint).
Kubernetes‑Specific Heap Dump Procedure
Enter the container: kubectl exec -it <pod> -- /bin/bash.
Run heap dump on the main process (usually PID 1): jmap -dump:live,format=b,file=/tmp/heap.hprof 1.
Copy the dump to the host:
kubectl cp <namespace>/<pod>:/tmp/heap.hprof ./heap.hprof.
Analyze the dump locally with MAT to avoid consuming container resources.
Conclusion
Frequent Full GC is a symptom of mismatched resource usage and application design rather than a JVM bug. By following a systematic observation‑analysis‑verification loop—collecting high‑quality GC logs and heap dumps, analyzing them with the right tools, and validating hypotheses against the code—engineers can pinpoint the exact cause (e.g., oversized static cache, uncontrolled message size, unpaginated DB access, ThreadLocal leakage, Metaspace bloat) and apply targeted fixes that restore performance and prevent future incidents.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
