Eliminating GC Pauses: Achieving 99.995% Uptime in a 100k QPS Java Service
Facing frequent timeouts in a high‑throughput Java service, we traced instability to long GC pauses during massive index swaps, then systematically tuned JVM parameters, explored G1, ZGC, and custom Eden‑pre‑heating techniques, ultimately achieving near‑perfect availability without adding hardware.
1. Introduction
Online discussions about JVM tuning often claim that most cases don’t need any tuning, or that the real problem lies in business code. While modern JDK defaults are excellent, extreme performance and stability requirements can still demand targeted JVM adjustments.
2. Problem Background
Team A operates a high‑concurrency system (up to 100k QPS, peak >400k QPS) with millisecond‑level response time requirements. During index switches the system experienced frequent timeouts and stability drops, with success rates falling from 95% to 99.995% after optimization.
3. Investigation Process
3.1 Initial Analysis
Logs showed only synchronous request timeouts; CPU and load were normal, ruling out traffic spikes. No external services or locks were involved, and the request flow consisted solely of in‑memory calculations.
3.2 Root Cause Identification
During the failure window a hot data publish (index switch) occurred. The index is ~0.5 GB; its copy during GC creates massive object churn. GC logs revealed a long‑lasting YGC Object Copy phase (≈200 ms) that pauses all application threads.
System A loads an index (a large in‑memory data structure) and periodically replaces the old index with a new one.
These long pauses directly caused upstream TimeoutExceptions.
4. Optimization Process
4.1 Common GC Tuning Ideas
Typical solutions (increase heap, add machines, use off‑heap memory) were unsuitable: the index size cannot be reduced, the algorithm does not support incremental updates, and off‑heap serialization overhead is prohibitive.
4.2 Detailed GC Log Analysis
Key JVM parameters:
-Xms12g
-Xmx12g
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=512m
-XX:+UseG1GC
-XX:G1HeapRegionSize=16M
-XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=45
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxDirectMemorySize=1gVisualization showed frequent short YGCs (blue dots) and occasional long YGCs (outlier blue points) that coincided with index switches.
4.3 Targeted JVM Parameter Tweaks
4.3.1 Promote Index Early to Old Generation
Setting MaxTenuringThreshold=1 forces objects to move to the old generation after a single young‑generation GC. Experiments showed G1 already performs direct tenuring for large objects, effectively using a threshold of 1.
4.3.2 Force Immediate Tenuring (Threshold 0)
Setting MaxTenuringThreshold=0 copies the index directly from Eden to the old generation, halving pause time.
4.3.3 InitialTenuringThreshold
Setting InitialTenuringThreshold=1 yields the same effect as the previous tweak.
4.3.4 AlwaysTenure
Enabling AlwaysTenure forces every object to be promoted, reducing the index copy count from two to one.
4.3.5 Direct Allocation to Old Generation (PretenureSizeThreshold)
For G1 this flag has no effect; large objects are still allocated in Eden because the index consists of millions of small objects.
4.3.6 G1HeapRegionSize
Increasing the region size did not change the copy path; the index remained allocated in Eden.
4.3.7 Accelerate Copy Speed
Parameters such as MaxGCPauseMillis, ParallelGCThreads, and ConcGCThreads were already optimal and did not yield noticeable gains.
4.3.8 Upgrade to JDK 11 ZGC
ZGC reduces STW pauses by performing concurrent object relocation. Tests showed improved stability (success rate 99.5%) but occasional Allocation Stalls still caused minor spikes.
4.4 Problem Review
The system must simultaneously satisfy low latency, extreme memory pressure, and very high concurrency. According to the CAP theorem, any two of these constraints can be met; the original configuration failed to meet all three.
4.5 “Index‑less” Switch – Gray Release + Eden Pre‑heat
We introduced a three‑step gray release: during the switch we pause traffic, load the new index, then deliberately fill the Eden space with temporary objects to force a YGC that moves the index to the old generation before traffic resumes.
public boolean switchIndex(String indexPath) {
try {
// 1. Load new index (traffic paused)
MyIndex newIndex = loadIndex(indexPath);
// 2. Switch index reference
this.index = newIndex;
// 3. Eden pre‑heat – allocate many temporary objects
for (int i = 0; i < 10000; i++) {
char[] tempArr = new char[524288];
}
// 4. Notify upstream that switch is complete
return true;
} catch (Exception e) {
return false;
}
}This guarantees that the large index is fully promoted before traffic resumes, making subsequent YGCs fast (millisecond‑level).
5. Summary
Through systematic JVM tuning—adjusting tenuring thresholds, using AlwaysTenure, upgrading to ZGC, and finally applying a gray‑release with Eden pre‑heating—we eliminated the long GC pauses caused by massive index copies. The service now maintains >99.995% availability even under 100k+ QPS and frequent GB‑scale index switches.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
