Troubleshooting High CPU Usage and Frequent GC in a Java Backend Pod
The article details a real‑world incident where a Java backend pod experienced 90% CPU usage and excessive young and full garbage collections, describing step‑by‑step diagnostics, stack analysis, and code fixes that resolved the performance problem.
While writing documentation on a Friday, the author received an alarm indicating CPU usage over 90% on a production platform. Monitoring the container revealed a pod that performed 61 young GCs and one full GC within two hours, a rare and serious issue.
Two monitoring graphs are shown: a normal JVM GC curve with minimal collections, and an abnormal curve where GC spikes dramatically, even triggering a full GC.
The investigation focused on the single problematic pod. Inside the pod, top was used to view process resource usage; the Java process (PID 1) showed CPU usage of 130% on a multi‑core system, indicating the Java application as the culprit.
The command top -H -p pid identified the thread (TID) consuming the most CPU. The TID was converted to hexadecimal with printf "%x\n" 746 because thread IDs appear in stack traces in hex.
Using jstack pid | grep 2ea >gc.stack, a thread dump was captured and filtered for the relevant thread, then saved to gc.stack. Because the file was large, it was downloaded via a temporary Python HTTP server ( python -m SimpleHTTPServer 8080) and curl from a jump host.
After retrieving the stack file locally, the author searched for the hex thread ID (2ea) and examined the stack trace, locating the implementation that performed asynchronous Excel export.
The root cause was identified: the export function reused a common list‑query API limited to 200 items per page, while the export required tens of thousands of records per user, leading to nested loops and massive object allocation that triggered repeated GC and pod restarts.
The code was fixed to handle large data sets properly, the change was deployed urgently, and the issue was resolved.
In conclusion, the author reflects on the importance of staying calm during incidents, verifying service availability, and considering pod scaling or restarts when multiple pods are affected. The experience reinforced the value of diligent troubleshooting and continuous learning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
