How We Eliminated GC Pauses in a 100k QPS Service: Deep Dive into JVM Tuning
This article details the step‑by‑step investigation and JVM‑level optimizations—including early tenuring, parameter tuning, ZGC migration, and an Eden‑pre‑heat trick—that transformed a high‑concurrency, low‑latency system from 95% to 99.995% success during massive index switches.
Background
A high‑concurrency service (≈100 k QPS, peak >400 k QPS) requires millisecond‑level response times. Periodic index refreshes (every 15 minutes) load a ~0.5 GB in‑memory index and caused intermittent upstream request time‑outs.
Root Cause
GC logs showed occasional long‑duration Young‑Generation GC (YGC) pauses up to 200 ms. The pause occurs in the Object Copy phase, where the large index object is copied from Eden to Survivor/Old, causing a Stop‑The‑World (STW) pause that blocks request processing.
GC Log Investigation
Visualization of GC events revealed many short YGCs (blue dots) and a few outlier YGCs with long duration. The long pauses align with the index copy operation.
Optimization Strategies
Typical mitigations such as adding machines, shrinking the index, or using off‑heap memory were either ineffective or introduced unacceptable overhead. The focus shifted to JVM tuning to reduce the impact of index copying.
1. Promote Index Early to Old Generation
By lowering the tenuring threshold, the index can skip the Survivor spaces. The following parameters were tested:
MaxTenuringThreshold=1 – promote after a single young‑generation collection.
InitialTenuringThreshold=1 – same effect for objects created at startup.
AlwaysTenure – force every object to be promoted directly.
With these settings the object flow changed from Eden → Survivor → Old (two copies) to Eden → Old (one copy), halving the pause time.
public boolean switchIndex(String indexPath) {
try {
// 1. Load new index (flow is paused)
MyIndex newIndex = loadIndex(indexPath);
// 2. Switch reference
this.index = newIndex;
// 3. Eden pre‑heat to force a GC while paused
for (int i = 0; i < 10000; i++) {
char[] tempArr = new char[524288];
}
// 4. Notify upstream that switch is complete
return true;
} catch (Exception e) {
return false;
}
}2. Direct Allocation to Old Generation
The PretenureSizeThreshold option does not affect G1GC, and adjusting G1HeapRegionSize did not change the allocation path because the index consists of millions of small objects.
3. GC Tuning Parameters
Additional G1GC parameters were examined:
MaxGCPauseMillis – target maximum pause time.
ParallelGCThreads – number of threads for parallel phases of YGC/FGC.
ConcGCThreads – threads for concurrent marking.
These settings offered little improvement because the copy time is dominated by the sheer size of the index.
4. Switch to ZGC (JDK 11)
ZGC uses colored pointers and read barriers, turning the copy phase into a mostly concurrent operation. After migration, pause times dropped dramatically and the success rate rose to >99.5%, with only occasional Allocation Stall spikes.
Final Technique – Eden Pre‑heat
To guarantee that the index is fully promoted before traffic resumes, a lightweight "pre‑heat" loop is executed while the service is in a paused (flow‑blocked) state. The loop allocates many temporary objects, deliberately exhausting Eden space and forcing an immediate YGC that moves the index to the Old generation. After the pre‑heat completes, traffic is restored and subsequent YGCs complete in a few milliseconds.
Results
Combining early tenuring, ZGC, and the Eden pre‑heat technique eliminated long GC pauses during index switches. The system’s success rate improved from 95 % → 98 % → 99.5 % → 99.995 % while maintaining stable latency under 10⁵ QPS.
Identify the exact GC phase causing pauses (Object Copy in YGC).
Use tenuring parameters to reduce copy count.
Consider modern low‑STW collectors such as ZGC.
When necessary, force a controlled GC (Eden pre‑heat) before traffic resumes.
These JVM‑level optimizations resolved the instability without code changes, additional hardware, or architectural redesign.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
