How to Diagnose and Fix JVM GC Pauses in High‑Concurrency Microservices

This article walks through a real‑world production case, detailing how to systematically detect, analyze, and resolve severe JVM garbage‑collection pauses in a high‑concurrency Spring Boot microservice, covering resource analysis, JVM flag tuning, G1GC migration, JMX listeners, and GC‑log investigation.

IT Services Circle
IT Services Circle
IT Services Circle
How to Diagnose and Fix JVM GC Pauses in High‑Concurrency Microservices

Introduction

This article walks through a real‑world production case to systematically diagnose and resolve JVM garbage‑collection (GC) performance problems in a high‑concurrency microservice built with Spring Boot.

System Background

The service runs as a microservice with the following stack:

Application framework: Spring Boot

Metrics collection: Micrometer

Monitoring system: Datadog

Micrometer supports many back‑ends such as AppOptics, Atlas, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, New Relic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront, etc.

Problem Symptoms

Problem Description

Monitoring revealed severe GC pauses on one node:

Maximum GC pause time frequently > 400 ms

Peak pause reached 546 ms on 2020‑02‑04 09:20:00

GC pause time chart
GC pause time chart

Business Impact

Service timeout: 1 s timeout, long GC pauses cause timeout risk

Performance requirement: max pause < 200 ms, average pause < 100 ms

Business impact: severe effect on customer trading strategies

Investigation Process

Step 1 – System Resource Analysis

CPU Load

CPU usage was examined; the monitoring chart shows:

CPU load chart
CPU load chart

Observed values: system load 4.92, CPU utilization ~7 %.

GC Memory Usage

Memory usage around 09:25 shows a sharp drop in old_gen, indicating a Full GC, but the period around 09:20 shows a gradual increase without a Full GC, meaning the long pause was not caused by a Full GC.

Old generation memory chart
Old generation memory chart

Step 2 – JVM Configuration Analysis

Startup Parameters

-Xmx4g -Xms4g

JDK version: 8

GC: default ParallelGC

Heap size: 4 GB (initial and max)

Initial Hypothesis

ParallelGC may be the root cause because it optimizes throughput at the expense of pause time.

First Optimization Attempt – Switch to G1GC

Why G1GC

Stability in JDK 8

Good latency control

Suitable for low‑latency workloads

Configuration

Initial (failed) config

# Parameter typo caused startup failure
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMills=50ms

Errors:

Typo: MaxGCPauseMillsMaxGCPauseMillis Value format: 50ms

50

Corrected config

-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50

After redeployment the service started successfully and monitoring showed GC pauses mostly under 50 ms.

G1GC early effect chart
G1GC early effect chart

Unexpected “Easter Egg”

Later a pause of 1300 ms appeared, and subsequent analysis showed the same pattern of long pauses.

Long pause chart
Long pause chart

Register GC Event Listener via JMX

Code to register a listener for each GarbageCollectorMXBean:

// Register listener for each memory pool
for (GarbageCollectorMXBean mbean : ManagementFactory.getGarbageCollectorMXBeans()) {
    if (!(mbean instanceof NotificationEmitter)) {
        continue; // not support listening
    }
    NotificationEmitter emitter = (NotificationEmitter) mbean;
    NotificationListener listener = getNewListener(mbean);
    emitter.addNotificationListener(listener, null, null);
}

The listener prints detailed GC event JSON, revealing a young‑generation pause of 1.869 s with 48 GC worker threads.

{
  "duration":1869,
  "maxPauseMillis":1869,
  "promotedBytes":"139MB",
  "gcCause":"G1 Evacuation Pause",
  "collectionTime":27281,
  "gcAction":"end of minor GC",
  "afterUsage":{
    "G1 Old Gen":"1745MB",
    "Code Cache":"53MB",
    "G1 Survivor Space":"254MB",
    "Compressed Class Space":"9MB",
    "Metaspace":"81MB",
    "G1 Eden Space":"0"
  },
  "gcId":326,
  "collectionCount":326,
  "gcName":"G1 Young Generation",
  "type":"jvm.gc.pause"
}

GC Log Analysis

Enabling -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps produced logs showing a 1.87 s pause with 48 parallel GC threads, while the container was limited to 4 CPU cores.

GC log excerpt
GC log excerpt

The mismatch between JVM‑detected CPU count (≈72) and the pod limit (4 cores) caused massive thread contention.

CPU load chart with pod limit
CPU load chart with pod limit

Final Solution – Limit GC Parallel Threads

Adding -XX:ParallelGCThreads=4 aligns GC workers with the pod’s CPU quota:

-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:ParallelGCThreads=4 -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps

After restart, GC pauses stayed within the 50 ms target.

Post‑tuning GC pause chart
Post‑tuning GC pause chart

Case Summary and Takeaways

Quantitative monitoring is essential for JVM performance tuning.

In containerized environments, JVM‑visible CPU cores must be reconciled with Kubernetes limits.

Adjusting ParallelGCThreads (or using G1GC) can dramatically reduce pause times.

Combining metric monitoring, JVM flag tuning, GC‑log analysis, and JMX listeners provides a systematic troubleshooting workflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMKubernetesGarbage Collectionperformance tuningg1gc
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.