Operations 23 min read

Eliminating GC Pauses: Achieving 99.995% Uptime in a 100k QPS Java Service

Facing frequent timeouts in a high‑throughput Java service, we traced instability to long GC pauses during massive index swaps, then systematically tuned JVM parameters, explored G1, ZGC, and custom Eden‑pre‑heating techniques, ultimately achieving near‑perfect availability without adding hardware.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Eliminating GC Pauses: Achieving 99.995% Uptime in a 100k QPS Java Service

1. Introduction

Online discussions about JVM tuning often claim that most cases don’t need any tuning, or that the real problem lies in business code. While modern JDK defaults are excellent, extreme performance and stability requirements can still demand targeted JVM adjustments.

2. Problem Background

Team A operates a high‑concurrency system (up to 100k QPS, peak >400k QPS) with millisecond‑level response time requirements. During index switches the system experienced frequent timeouts and stability drops, with success rates falling from 95% to 99.995% after optimization.

3. Investigation Process

3.1 Initial Analysis

Logs showed only synchronous request timeouts; CPU and load were normal, ruling out traffic spikes. No external services or locks were involved, and the request flow consisted solely of in‑memory calculations.

3.2 Root Cause Identification

During the failure window a hot data publish (index switch) occurred. The index is ~0.5 GB; its copy during GC creates massive object churn. GC logs revealed a long‑lasting YGC Object Copy phase (≈200 ms) that pauses all application threads.

System A loads an index (a large in‑memory data structure) and periodically replaces the old index with a new one.

These long pauses directly caused upstream TimeoutExceptions.

4. Optimization Process

4.1 Common GC Tuning Ideas

Typical solutions (increase heap, add machines, use off‑heap memory) were unsuitable: the index size cannot be reduced, the algorithm does not support incremental updates, and off‑heap serialization overhead is prohibitive.

4.2 Detailed GC Log Analysis

Key JVM parameters:

-Xms12g
-Xmx12g
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=512m
-XX:+UseG1GC
-XX:G1HeapRegionSize=16M
-XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=45
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxDirectMemorySize=1g

Visualization showed frequent short YGCs (blue dots) and occasional long YGCs (outlier blue points) that coincided with index switches.

4.3 Targeted JVM Parameter Tweaks

4.3.1 Promote Index Early to Old Generation

Setting MaxTenuringThreshold=1 forces objects to move to the old generation after a single young‑generation GC. Experiments showed G1 already performs direct tenuring for large objects, effectively using a threshold of 1.

4.3.2 Force Immediate Tenuring (Threshold 0)

Setting MaxTenuringThreshold=0 copies the index directly from Eden to the old generation, halving pause time.

4.3.3 InitialTenuringThreshold

Setting InitialTenuringThreshold=1 yields the same effect as the previous tweak.

4.3.4 AlwaysTenure

Enabling AlwaysTenure forces every object to be promoted, reducing the index copy count from two to one.

4.3.5 Direct Allocation to Old Generation (PretenureSizeThreshold)

For G1 this flag has no effect; large objects are still allocated in Eden because the index consists of millions of small objects.

4.3.6 G1HeapRegionSize

Increasing the region size did not change the copy path; the index remained allocated in Eden.

4.3.7 Accelerate Copy Speed

Parameters such as MaxGCPauseMillis, ParallelGCThreads, and ConcGCThreads were already optimal and did not yield noticeable gains.

4.3.8 Upgrade to JDK 11 ZGC

ZGC reduces STW pauses by performing concurrent object relocation. Tests showed improved stability (success rate 99.5%) but occasional Allocation Stalls still caused minor spikes.

4.4 Problem Review

The system must simultaneously satisfy low latency, extreme memory pressure, and very high concurrency. According to the CAP theorem, any two of these constraints can be met; the original configuration failed to meet all three.

4.5 “Index‑less” Switch – Gray Release + Eden Pre‑heat

We introduced a three‑step gray release: during the switch we pause traffic, load the new index, then deliberately fill the Eden space with temporary objects to force a YGC that moves the index to the old generation before traffic resumes.

public boolean switchIndex(String indexPath) {
    try {
        // 1. Load new index (traffic paused)
        MyIndex newIndex = loadIndex(indexPath);
        // 2. Switch index reference
        this.index = newIndex;
        // 3. Eden pre‑heat – allocate many temporary objects
        for (int i = 0; i < 10000; i++) {
            char[] tempArr = new char[524288];
        }
        // 4. Notify upstream that switch is complete
        return true;
    } catch (Exception e) {
        return false;
    }
}

This guarantees that the large index is fully promoted before traffic resumes, making subsequent YGCs fast (millisecond‑level).

5. Summary

Through systematic JVM tuning—adjusting tenuring thresholds, using AlwaysTenure, upgrading to ZGC, and finally applying a gray‑release with Eden pre‑heating—we eliminated the long GC pauses caused by massive index copies. The service now maintains >99.995% availability even under 100k+ QPS and frequent GB‑scale index switches.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaJVMGarbage Collectionperformance tuninghigh concurrency
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.