How We Eliminated GC‑Induced Pauses in a 100k QPS Service
This article details a step‑by‑step investigation of a high‑concurrency, low‑latency system whose instability was traced to long‑lasting Young‑GC pauses during massive index swaps, and explains how targeted JVM parameter tweaks, GC‑log analysis, and a lightweight Eden‑pre‑heat technique finally achieved near‑perfect availability.
Background and Problem Statement
The team operated a high‑throughput service (≈100 k QPS, spikes >400 k QPS) that required millisecond‑level response times. Frequent time‑out errors appeared during index hot‑swaps, and initial checks showed no traffic spikes, CPU overload, or external dependency issues.
Root‑Cause Investigation
Log analysis revealed that each index swap triggered a long‑lasting Young‑GC (YGC) pause caused by the Object Copy phase, where a ~0.5 GB index was copied from Eden to Survivor/Old generations. The pause (up to 200 ms) stalled all request‑handling threads, leading to upstream TimeoutException errors.
GC‑Log Deep Dive
Using an internal ATP visualizer, the team identified patterns:
Frequent short YGCs (blue dots) – normal.
Occasional long YGCs (red dots) – correlated with index swaps.
Each long YGC was followed by a sharp increase in Old‑gen usage (purple line).
Further inspection showed that the long YGCs always occurred in pairs: the first promoted the new index to Survivor, the second promoted it to Old, both incurring heavy copy costs.
Regular Optimization Ideas (Rejected)
Typical remedies such as adding machines, shrinking the index, or moving it to off‑heap memory were unsuitable because the index size could not be reduced and the workload required real‑time access.
Targeted JVM Parameter Tuning
Given the constraints, the focus shifted to JVM flags that could reduce copy overhead:
-Xms12g
-Xmx12g
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=512m
-XX:+UseG1GC
-XX:G1HeapRegionSize=16M
-XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=45
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxDirectMemorySize=1gKey parameters explored:
MaxTenuringThreshold : controls how many YGC cycles an object can survive before promotion. Setting it to 1 already forced the index to skip most Survivor cycles.
InitialTenuringThreshold : similar effect; experiments confirmed the index path was Eden → S0 → Old with the default G1 behavior.
AlwaysTenure : forces every object to promote directly to Old, achieving the same reduction in copy steps.
Experiments showed that forcing MaxTenuringThreshold=0 or AlwaysTenure reduced the number of copy phases from two to one, halving the pause duration and raising success rates from 95 % to 98 %.
Attempted Direct Old‑Gen Allocation
The team tried PretenureSizeThreshold and G1HeapRegionSize to allocate the index directly into Old, but G1 ignored these settings for the many small objects that compose the index, so no benefit was observed.
Accelerating the Copy Process
Adjusting parallelism parameters ( MaxGCPauseMillis, ParallelGCThreads, ConcGCThreads) yielded negligible gains because the copy time was dominated by the sheer volume of data.
Switching to ZGC (JDK 11)
ZGC’s concurrent copying reduced STW pauses dramatically. After migration, the service’s success rate climbed to 99.5 %, though occasional Allocation Stall events still caused minor spikes.
Index‑No‑Feel Switch via Eden Pre‑Heat
To eliminate the remaining pauses, the authors introduced a lightweight “pre‑heat” step during a gray‑release that deliberately fills Eden with temporary objects, forcing a YGC that moves the newly loaded index fully into Old before traffic resumes. The added code is:
public boolean switchIndex(String indexPath) {
try {
// 1. Load new index (traffic paused)
MyIndex newIndex = loadIndex(indexPath);
// 2. Switch index
this.index = newIndex;
// 3. Eden pre‑heat: allocate many short‑lived objects
for (int i = 0; i < 10000; i++) {
char[] tempArr = new char[524288];
}
// 4. Notify completion
return true;
} catch (Exception e) {
return false;
}
}This forces the index to be copied to Old during the pause, after which subsequent YGCs complete in milliseconds, eliminating observable latency spikes.
Final Results
Combining the JVM flag tweaks, ZGC migration, and the pre‑heat gray‑release strategy raised the system’s stable success rate to >99.995 %, effectively achieving “index‑no‑feel” swaps even under 10⁵ QPS load.
Conclusion
The case demonstrates that for ultra‑high‑throughput, low‑latency Java services, deep GC‑log analysis, precise JVM tuning, and controlled release workflows can together eradicate GC‑induced instability without adding hardware or changing business logic.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
