How to Diagnose and Fix JVM GC Pause Issues in High‑Concurrency Microservices
This article walks through a real‑world production case, showing how to systematically detect, analyze, and resolve severe JVM garbage‑collection pauses that caused service timeouts, by examining CPU, memory, GC logs, adjusting collector settings, and aligning JVM threads with Kubernetes resource limits.
The article presents a step‑by‑step troubleshooting case of a high‑concurrency microservice built with Spring Boot, Micrometer for metrics, and Datadog for monitoring. The service suffered frequent GC pauses exceeding 400 ms, with a peak of 546 ms, leading to request timeouts (1 s) and degraded business performance.
System Background
Technology stack: Spring Boot, Micrometer (metrics), Datadog (monitoring). Micrometer supports many back‑ends such as Datadog, Prometheus, etc.
Problem Symptoms
Maximum GC pause time often > 400 ms.
Peak pause reached 546 ms on 2020‑02‑04 09:20:00.
Service timeout threshold is 1 s, so long pauses cause failures.
Target GC pause: max < 200 ms, average < 100 ms.
Investigation Process
1. System Resource Analysis – CPU
CPU usage was low (≈7 %); load average was 4.92, indicating no CPU saturation.
2. JVM Memory Usage
GC logs showed a sharp drop in old_gen around 09:25, confirming a Full GC, but the long pauses occurred before that, when old generation was slowly increasing.
The memory pool name ps_old_gen indicates the default ParallelGC collector.
3. JVM Configuration Analysis
-Xmx4g -Xms4gJDK version: 8
GC: default ParallelGC (no explicit -XX:+UseParallelGC)
Heap size: 4 GB (both max and initial)
Initial hypothesis: ParallelGC’s emphasis on throughput may cause long stop‑the‑world pauses.
4. First Optimization – Switch to G1GC
Attempted to enable G1GC with the following (incorrect) parameters:
# 参数配置错误,导致启动失败
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMills=50msErrors identified:
Misspelled flag MaxGCPauseMills → MaxGCPauseMillis Value should be numeric (50), not "50ms".
Corrected configuration:
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50After redeploying, GC pause times dropped to < 50 ms, meeting the initial goal, but later a new spike of 1300 ms appeared, indicating the problem was not fully solved.
5. Register GC Event Listener (JMX)
Added a listener to capture GC events directly:
// Register listener for each memory pool
for (GarbageCollectorMXBean mbean : ManagementFactory.getGarbageCollectorMXBeans()) {
if (!(mbean instanceof NotificationEmitter)) {
continue; // not listenable
}
NotificationEmitter emitter = (NotificationEmitter) mbean;
NotificationListener listener = getNewListener(mbean);
emitter.addNotificationListener(listener, null, null);
}Sample logged JSON:
{
"duration":1869,
"maxPauseMillis":1869,
"promotedBytes":"139MB",
"gcCause":"G1 Evacuation Pause",
"collectionTime":27281,
"gcAction":"end of minor GC",
"afterUsage":{
"G1 Old Gen":"1745MB",
"Code Cache":"53MB",
"G1 Survivor Space":"254MB",
"Compressed Class Space":"9MB",
"Metaspace":"81MB",
"G1 Eden Space":"0"
},
"gcId":326,
"collectionCount":326,
"gcName":"G1 Young Generation",
"type":"jvm.gc.pause"
}Logs revealed that the long pauses were caused by young‑generation GC (not Full GC) with a pause of 1.869 s, far exceeding the 50 ms target.
6. Detailed GC Log Analysis
Collected GC logs with:
-Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStampsKey observations:
Physical memory reported as 144 GB (free 58 GB), while the pod was limited to 4 CPU cores and 8 GB RAM.
Parallel GC threads defaulted to ~48 (5/8 of 72 logical CPUs visible to the JVM).
During a pause, 48 GC worker threads competed for only 4 CPU cores, causing massive context‑switch overhead.
7. Final Tuning – Align GC Threads with Container Limits
Added -XX:ParallelGCThreads=4 to match the pod’s 4‑core limit, keeping other flags:
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:ParallelGCThreads=4
-Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStampsAfter restart, GC pause times consistently stayed below 50 ms, confirming the fix.
Case Summary & Lessons Learned
Quantitative monitoring is essential; JVM tuning must be data‑driven.
In containerized environments, JVM‑visible CPU count may differ from the actual limits, leading to mismatched GC thread counts.
Adjust -XX:ParallelGCThreads (or equivalent) to the number of CPUs allocated to the pod.
Switching to G1GC can reduce pause latency, but proper thread configuration remains critical.
Using JMX listeners and detailed GC logs provides deep insight into pause causes.
By following a systematic approach—observing symptoms, forming hypotheses, validating with logs and metrics, and applying targeted configuration changes—the team resolved the severe GC pause issue and restored service reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
