Diagnosing Excessive GC and CPU Spikes in a Kubernetes Java Pod
When a production pod suddenly hit 90% CPU and dozens of young and full GCs within two hours, the author walks through a step‑by‑step investigation using top, thread‑level monitoring, jstack, and stack analysis to pinpoint a Java‑level memory issue and resolve it.
Scenario
The alert system reported CPU usage over 90% on a Friday while writing documentation. Monitoring showed a single pod generating 61 young GCs and one full GC in two hours, which is rare and serious.
Normal vs. Problematic GC Curves
Below are two monitoring screenshots: the first shows a typical JVM GC curve with minimal activity, and the second shows a spike of frequent GC events, including a full GC.
Detailed Analysis
The abnormal GC only occurred on one pod. After locating the pod in the monitoring system, the following steps were performed inside the container.
Run top to view Linux process resource usage. The Java process (PID 1) showed CPU usage of 130% (multi‑core), indicating the Java application was the culprit.
Run top -H -p <pid> to list threads of that process and identify the thread with the highest CPU consumption. The thread ID (tid) was 746.
Convert the decimal thread ID to hexadecimal because stack traces use hex IDs: printf "%x\n" 746 → 2ea.
Capture the stack trace of the target thread using jstack: jstack <pid> | grep 2ea > gc.stack. The jstack tool generates a snapshot of all Java threads; piping the output filters only the relevant thread.
Since the container had limited access, the stack file was served via a simple HTTP server: python -m SimpleHTTPServer 8080, then downloaded from another machine with curl -o http://<ip>/gcInfo.stack.
Open the downloaded gc.stack locally and search for 2ea to locate the stack frames of the problematic thread.
Trace the stack frames back to the implementation code. The issue originated in an asynchronous Excel export feature that reused a common list‑query API limited to 200 items per page, while the export request could involve tens of thousands of records.
The code performed nested loops and repeatedly created new lists, causing excessive object allocation and frequent GC, which eventually triggered a full GC and pod restart.
The fix involved refactoring the export logic to avoid the large in‑memory list and to paginate properly, then redeploying the service. After the hotfix, GC frequency dropped dramatically and the pod stabilized.
Conclusion
The incident highlights the importance of staying calm during alerts, isolating the problematic pod, and using low‑level JVM tools to trace thread activity. Checking whether all pods are affected helps decide if scaling or a restart is needed. In this case, only one pod was impacted, and a targeted code fix resolved the issue.
Author: 我再也不喝酒啦 Source: juejin.cn/post/7139202066362138654
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
