Operations 22 min read

How We Eliminated GC‑Induced Pauses in a 100k QPS Service

This article details a step‑by‑step investigation of a high‑concurrency, low‑latency system whose instability was traced to long‑lasting Young‑GC pauses during massive index swaps, and explains how targeted JVM parameter tweaks, GC‑log analysis, and a lightweight Eden‑pre‑heat technique finally achieved near‑perfect availability.

dbaplus Community
dbaplus Community
dbaplus Community
How We Eliminated GC‑Induced Pauses in a 100k QPS Service

Background and Problem Statement

The team operated a high‑throughput service (≈100 k QPS, spikes >400 k QPS) that required millisecond‑level response times. Frequent time‑out errors appeared during index hot‑swaps, and initial checks showed no traffic spikes, CPU overload, or external dependency issues.

Root‑Cause Investigation

Log analysis revealed that each index swap triggered a long‑lasting Young‑GC (YGC) pause caused by the Object Copy phase, where a ~0.5 GB index was copied from Eden to Survivor/Old generations. The pause (up to 200 ms) stalled all request‑handling threads, leading to upstream TimeoutException errors.

GC‑Log Deep Dive

Using an internal ATP visualizer, the team identified patterns:

Frequent short YGCs (blue dots) – normal.

Occasional long YGCs (red dots) – correlated with index swaps.

Each long YGC was followed by a sharp increase in Old‑gen usage (purple line).

Further inspection showed that the long YGCs always occurred in pairs: the first promoted the new index to Survivor, the second promoted it to Old, both incurring heavy copy costs.

Regular Optimization Ideas (Rejected)

Typical remedies such as adding machines, shrinking the index, or moving it to off‑heap memory were unsuitable because the index size could not be reduced and the workload required real‑time access.

Targeted JVM Parameter Tuning

Given the constraints, the focus shifted to JVM flags that could reduce copy overhead:

-Xms12g
-Xmx12g
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=512m
-XX:+UseG1GC
-XX:G1HeapRegionSize=16M
-XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=45
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxDirectMemorySize=1g

Key parameters explored:

MaxTenuringThreshold : controls how many YGC cycles an object can survive before promotion. Setting it to 1 already forced the index to skip most Survivor cycles.

InitialTenuringThreshold : similar effect; experiments confirmed the index path was Eden → S0 → Old with the default G1 behavior.

AlwaysTenure : forces every object to promote directly to Old, achieving the same reduction in copy steps.

Experiments showed that forcing MaxTenuringThreshold=0 or AlwaysTenure reduced the number of copy phases from two to one, halving the pause duration and raising success rates from 95 % to 98 %.

Attempted Direct Old‑Gen Allocation

The team tried PretenureSizeThreshold and G1HeapRegionSize to allocate the index directly into Old, but G1 ignored these settings for the many small objects that compose the index, so no benefit was observed.

Accelerating the Copy Process

Adjusting parallelism parameters ( MaxGCPauseMillis, ParallelGCThreads, ConcGCThreads) yielded negligible gains because the copy time was dominated by the sheer volume of data.

Switching to ZGC (JDK 11)

ZGC’s concurrent copying reduced STW pauses dramatically. After migration, the service’s success rate climbed to 99.5 %, though occasional Allocation Stall events still caused minor spikes.

Index‑No‑Feel Switch via Eden Pre‑Heat

To eliminate the remaining pauses, the authors introduced a lightweight “pre‑heat” step during a gray‑release that deliberately fills Eden with temporary objects, forcing a YGC that moves the newly loaded index fully into Old before traffic resumes. The added code is:

public boolean switchIndex(String indexPath) {
    try {
        // 1. Load new index (traffic paused)
        MyIndex newIndex = loadIndex(indexPath);
        // 2. Switch index
        this.index = newIndex;
        // 3. Eden pre‑heat: allocate many short‑lived objects
        for (int i = 0; i < 10000; i++) {
            char[] tempArr = new char[524288];
        }
        // 4. Notify completion
        return true;
    } catch (Exception e) {
        return false;
    }
}

This forces the index to be copied to Old during the pause, after which subsequent YGCs complete in milliseconds, eliminating observable latency spikes.

Final Results

Combining the JVM flag tweaks, ZGC migration, and the pre‑heat gray‑release strategy raised the system’s stable success rate to >99.995 %, effectively achieving “index‑no‑feel” swaps even under 10⁵ QPS load.

Conclusion

The case demonstrates that for ultra‑high‑throughput, low‑latency Java services, deep GC‑log analysis, precise JVM tuning, and controlled release workflows can together eradicate GC‑induced instability without adding hardware or changing business logic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaJVMPerformance Optimizationzgcg1gcGC tuning
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.