How to Diagnose and Fix JVM GC Pause Issues in High‑Concurrency Microservices

This article walks through a real‑world production case, showing how to systematically detect, analyze, and resolve severe JVM garbage‑collection pauses that caused service timeouts, by examining CPU, memory, GC logs, adjusting collector settings, and aligning JVM threads with Kubernetes resource limits.

dbaplus Community
dbaplus Community
dbaplus Community
How to Diagnose and Fix JVM GC Pause Issues in High‑Concurrency Microservices

The article presents a step‑by‑step troubleshooting case of a high‑concurrency microservice built with Spring Boot, Micrometer for metrics, and Datadog for monitoring. The service suffered frequent GC pauses exceeding 400 ms, with a peak of 546 ms, leading to request timeouts (1 s) and degraded business performance.

System Background

Technology stack: Spring Boot, Micrometer (metrics), Datadog (monitoring). Micrometer supports many back‑ends such as Datadog, Prometheus, etc.

Problem Symptoms

Maximum GC pause time often > 400 ms.

Peak pause reached 546 ms on 2020‑02‑04 09:20:00.

Service timeout threshold is 1 s, so long pauses cause failures.

Target GC pause: max < 200 ms, average < 100 ms.

GC pause monitoring chart
GC pause monitoring chart

Investigation Process

1. System Resource Analysis – CPU

CPU usage was low (≈7 %); load average was 4.92, indicating no CPU saturation.

CPU load chart
CPU load chart

2. JVM Memory Usage

GC logs showed a sharp drop in old_gen around 09:25, confirming a Full GC, but the long pauses occurred before that, when old generation was slowly increasing.

Heap usage chart
Heap usage chart

The memory pool name ps_old_gen indicates the default ParallelGC collector.

3. JVM Configuration Analysis

-Xmx4g -Xms4g

JDK version: 8

GC: default ParallelGC (no explicit -XX:+UseParallelGC)

Heap size: 4 GB (both max and initial)

Initial hypothesis: ParallelGC’s emphasis on throughput may cause long stop‑the‑world pauses.

4. First Optimization – Switch to G1GC

Attempted to enable G1GC with the following (incorrect) parameters:

# 参数配置错误,导致启动失败
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMills=50ms

Errors identified:

Misspelled flag MaxGCPauseMillsMaxGCPauseMillis Value should be numeric (50), not "50ms".

Corrected configuration:

-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50

After redeploying, GC pause times dropped to < 50 ms, meeting the initial goal, but later a new spike of 1300 ms appeared, indicating the problem was not fully solved.

G1GC early effect chart
G1GC early effect chart

5. Register GC Event Listener (JMX)

Added a listener to capture GC events directly:

// Register listener for each memory pool
for (GarbageCollectorMXBean mbean : ManagementFactory.getGarbageCollectorMXBeans()) {
    if (!(mbean instanceof NotificationEmitter)) {
        continue; // not listenable
    }
    NotificationEmitter emitter = (NotificationEmitter) mbean;
    NotificationListener listener = getNewListener(mbean);
    emitter.addNotificationListener(listener, null, null);
}

Sample logged JSON:

{
  "duration":1869,
  "maxPauseMillis":1869,
  "promotedBytes":"139MB",
  "gcCause":"G1 Evacuation Pause",
  "collectionTime":27281,
  "gcAction":"end of minor GC",
  "afterUsage":{
    "G1 Old Gen":"1745MB",
    "Code Cache":"53MB",
    "G1 Survivor Space":"254MB",
    "Compressed Class Space":"9MB",
    "Metaspace":"81MB",
    "G1 Eden Space":"0"
  },
  "gcId":326,
  "collectionCount":326,
  "gcName":"G1 Young Generation",
  "type":"jvm.gc.pause"
}

Logs revealed that the long pauses were caused by young‑generation GC (not Full GC) with a pause of 1.869 s, far exceeding the 50 ms target.

6. Detailed GC Log Analysis

Collected GC logs with:

-Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps

Key observations:

Physical memory reported as 144 GB (free 58 GB), while the pod was limited to 4 CPU cores and 8 GB RAM.

Parallel GC threads defaulted to ~48 (5/8 of 72 logical CPUs visible to the JVM).

During a pause, 48 GC worker threads competed for only 4 CPU cores, causing massive context‑switch overhead.

7. Final Tuning – Align GC Threads with Container Limits

Added -XX:ParallelGCThreads=4 to match the pod’s 4‑core limit, keeping other flags:

-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:ParallelGCThreads=4
-Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps

After restart, GC pause times consistently stayed below 50 ms, confirming the fix.

Post‑tuning GC pause chart
Post‑tuning GC pause chart

Case Summary & Lessons Learned

Quantitative monitoring is essential; JVM tuning must be data‑driven.

In containerized environments, JVM‑visible CPU count may differ from the actual limits, leading to mismatched GC thread counts.

Adjust -XX:ParallelGCThreads (or equivalent) to the number of CPUs allocated to the pod.

Switching to G1GC can reduce pause latency, but proper thread configuration remains critical.

Using JMX listeners and detailed GC logs provides deep insight into pause causes.

By following a systematic approach—observing symptoms, forming hypotheses, validating with logs and metrics, and applying targeted configuration changes—the team resolved the severe GC pause issue and restored service reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMKubernetesperformance tuningSpring Bootg1gcgcParallelGCThreads
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.