Understanding and Solving GC Spikes in High‑Throughput Java Services
The article explains what a GC spike (Garbage Collection Spike) is, analyzes its typical causes such as large short‑lived objects, memory leaks, and heap configuration, presents a real‑world high‑concurrency case study, and details step‑by‑step JVM tuning and architectural strategies that reduced latency spikes and raised service availability from 95% to over 99.99%.
GC Spike Definition
GC spike (also called GC hair or GC突刺) is a sudden sharp increase in response‑time or CPU caused by a long Stop‑The‑World (STW) pause during garbage collection.
Typical Causes
Short‑lived large objects : Frequent allocation of large arrays or collections fills the young generation, leading to frequent Minor GCs with increasing pause time.
Memory leaks : Unreleased static collections, resources, or improper ThreadLocal usage fill the old generation, triggering long Full GCs.
Massive long‑living small objects : In the case study, ~500 MB of index data consists of millions of tiny objects that stay alive for a long time, causing repeated copying in the young generation.
Heap configuration : Too small or too large heap sizes affect GC frequency and pause length.
GC parameters : Mis‑configured MaxGCPauseMillis, promotion thresholds, or collector choice can exacerbate pauses.
External factors : Heavy synchronous logging, batch jobs processing huge data sets, etc., increase GC pressure.
Impact of GC Spikes
Severe latency jitter (P99/P999 spikes) degrading user experience.
Throughput drop as CPU is consumed by GC rather than business logic.
Upstream timeouts and cascading failures in distributed systems.
Case Study: High‑Concurrency Service (System A)
System A processes >100k QPS (up to 400k during promotions) with millisecond‑level latency requirements. Every 15 minutes it performs a full‑index reload (~500 MB). During a promotion, upstream services reported intermittent TimeoutException alerts. Correlation of monitoring data showed periodic response‑time spikes matching Full GC timestamps.
Root‑Cause Analysis
The index consists of many small objects (each a few KB) allocated in the Eden space. Because they survive long, they are copied to Survivor space and later promoted to the old generation. This double copying incurs an Object Copy phase >200 ms in Young GC, causing STW pauses and request timeouts.
Young GC frequency is high but usually fast; a few outliers have dramatically longer pause times.
Each long pause coincides with the index copy‑and‑promote cycle.
Optimization Process
1. Promote Early
Adjust the promotion threshold so the large index jumps to the old generation after the first copy.
-Xms12g -Xmx12g
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=512m
-XX:+UseG1GC
-XX:G1HeapRegionSize=16M
-XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=45
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxDirectMemorySize=1gExperiments: -XX:MaxTenuringThreshold=0 – forces promotion after the first Young GC. -XX:InitialTenuringThreshold=1 – similar effect. -XX:+AlwaysTenure – all surviving objects are directly promoted.
Verification showed the index flow changed from Eden → S0 → Old to a direct Eden → Old path, cutting the copy count from 2 to 1 and halving the pause.
2. Direct Old‑Generation Allocation Attempt
Used PretenureSizeThreshold and increased -XX:G1HeapRegionSize to treat the index as a humongous object. Neither worked because the index is composed of many tiny objects, not a single large array.
3. Switch to Low‑Pause Collector (ZGC)
Enabled ZGC ( -XX:+UseZGC) on JDK 11. STW pauses disappeared, raising success rate to 99.5%. However, during index reload an Allocation Stall appeared when Eden ran out of memory, causing a brief latency spike.
4. Gray‑Release + Active Pre‑Heat Strategy
During a controlled traffic cut‑off, the service loads the new index and then creates a large number of temporary objects to exhaust Eden, forcing an immediate Young GC that promotes the index to the old generation. After traffic resumes, the young generation is empty, so subsequent GCs are fast.
public boolean switchIndex(String indexPath){
try {
MyIndex newIndex = loadIndex(indexPath);
this.index = newIndex;
// Pre‑heat: allocate many short‑lived objects to trigger YGC
for (int i = 0; i < 10000; i++) {
char[] tempArr = new char[524288];
}
return true;
} catch (Exception e) {
return false;
}
}Results
G1GC + default: availability ≈ 95% (baseline).
-XX:MaxTenuringThreshold=0 (or InitialTenuringThreshold=1, or +AlwaysTenure): availability ≈ 98%.
ZGC + default: availability ≈ 99.5%.
G1GC + gray‑release + Eden pre‑heat: availability ≈ 99.995% and latency spikes vanished.
Conclusion
By systematically analyzing GC logs, identifying the double‑copy pattern of massive long‑living objects, and applying a combination of JVM tuning, collector migration, and controlled traffic‑cut‑off with active pre‑heat, the service eliminated GC‑induced latency spikes and achieved near‑perfect availability in a high‑throughput, low‑latency environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
