Understanding and Solving GC Spikes in High‑Throughput Java Services

The article explains what a GC spike (Garbage Collection Spike) is, analyzes its typical causes such as large short‑lived objects, memory leaks, and heap configuration, presents a real‑world high‑concurrency case study, and details step‑by‑step JVM tuning and architectural strategies that reduced latency spikes and raised service availability from 95% to over 99.99%.

Tech Freedom Circle
Tech Freedom Circle
Tech Freedom Circle
Understanding and Solving GC Spikes in High‑Throughput Java Services

GC Spike Definition

GC spike (also called GC hair or GC突刺) is a sudden sharp increase in response‑time or CPU caused by a long Stop‑The‑World (STW) pause during garbage collection.

Typical Causes

Short‑lived large objects : Frequent allocation of large arrays or collections fills the young generation, leading to frequent Minor GCs with increasing pause time.

Memory leaks : Unreleased static collections, resources, or improper ThreadLocal usage fill the old generation, triggering long Full GCs.

Massive long‑living small objects : In the case study, ~500 MB of index data consists of millions of tiny objects that stay alive for a long time, causing repeated copying in the young generation.

Heap configuration : Too small or too large heap sizes affect GC frequency and pause length.

GC parameters : Mis‑configured MaxGCPauseMillis, promotion thresholds, or collector choice can exacerbate pauses.

External factors : Heavy synchronous logging, batch jobs processing huge data sets, etc., increase GC pressure.

Impact of GC Spikes

Severe latency jitter (P99/P999 spikes) degrading user experience.

Throughput drop as CPU is consumed by GC rather than business logic.

Upstream timeouts and cascading failures in distributed systems.

Case Study: High‑Concurrency Service (System A)

System A processes >100k QPS (up to 400k during promotions) with millisecond‑level latency requirements. Every 15 minutes it performs a full‑index reload (~500 MB). During a promotion, upstream services reported intermittent TimeoutException alerts. Correlation of monitoring data showed periodic response‑time spikes matching Full GC timestamps.

Root‑Cause Analysis

The index consists of many small objects (each a few KB) allocated in the Eden space. Because they survive long, they are copied to Survivor space and later promoted to the old generation. This double copying incurs an Object Copy phase >200 ms in Young GC, causing STW pauses and request timeouts.

Young GC frequency is high but usually fast; a few outliers have dramatically longer pause times.

Each long pause coincides with the index copy‑and‑promote cycle.

Optimization Process

1. Promote Early

Adjust the promotion threshold so the large index jumps to the old generation after the first copy.

-Xms12g -Xmx12g
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=512m
-XX:+UseG1GC
-XX:G1HeapRegionSize=16M
-XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=45
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxDirectMemorySize=1g

Experiments: -XX:MaxTenuringThreshold=0 – forces promotion after the first Young GC. -XX:InitialTenuringThreshold=1 – similar effect. -XX:+AlwaysTenure – all surviving objects are directly promoted.

Verification showed the index flow changed from Eden → S0 → Old to a direct Eden → Old path, cutting the copy count from 2 to 1 and halving the pause.

2. Direct Old‑Generation Allocation Attempt

Used PretenureSizeThreshold and increased -XX:G1HeapRegionSize to treat the index as a humongous object. Neither worked because the index is composed of many tiny objects, not a single large array.

3. Switch to Low‑Pause Collector (ZGC)

Enabled ZGC ( -XX:+UseZGC) on JDK 11. STW pauses disappeared, raising success rate to 99.5%. However, during index reload an Allocation Stall appeared when Eden ran out of memory, causing a brief latency spike.

4. Gray‑Release + Active Pre‑Heat Strategy

During a controlled traffic cut‑off, the service loads the new index and then creates a large number of temporary objects to exhaust Eden, forcing an immediate Young GC that promotes the index to the old generation. After traffic resumes, the young generation is empty, so subsequent GCs are fast.

public boolean switchIndex(String indexPath){
    try {
        MyIndex newIndex = loadIndex(indexPath);
        this.index = newIndex;
        // Pre‑heat: allocate many short‑lived objects to trigger YGC
        for (int i = 0; i < 10000; i++) {
            char[] tempArr = new char[524288];
        }
        return true;
    } catch (Exception e) {
        return false;
    }
}

Results

G1GC + default: availability ≈ 95% (baseline).

-XX:MaxTenuringThreshold=0 (or InitialTenuringThreshold=1, or +AlwaysTenure): availability ≈ 98%.

ZGC + default: availability ≈ 99.5%.

G1GC + gray‑release + Eden pre‑heat: availability ≈ 99.995% and latency spikes vanished.

Conclusion

By systematically analyzing GC logs, identifying the double‑copy pattern of massive long‑living objects, and applying a combination of JVM tuning, collector migration, and controlled traffic‑cut‑off with active pre‑heat, the service eliminated GC‑induced latency spikes and achieved near‑perfect availability in a high‑throughput, low‑latency environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaPerformancezgcg1gcjvm-tuninggchigh-concurrency
Tech Freedom Circle
Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.