Why Did My Kubernetes Pod Trigger 61 GC Events? A Step‑by‑Step Debugging Guide
The author recounts a sudden CPU spike on a Kubernetes pod caused by excessive JVM garbage collection, walks through step‑by‑step diagnostics using top, thread inspection, jstack, and network file transfer, identifies a flawed Excel export loop, and shares the fix and lessons learned.
1. Scenario
The issue appeared on a Friday when a documentation task was interrupted by an alarm: CPU usage jumped above 90%. Monitoring showed a pod generating 61 young GC events and one full GC within two hours, a rare and severe problem.
Normal JVM monitoring curve
Below is a typical GC curve where GC events are infrequent.
Problematic JVM monitoring curve
The second image shows the abnormal spike with many GC events, including a full GC.
2. Detailed Analysis
The abnormal GC occurred only on a single pod. After locating the pod, the author entered it and began systematic investigation.
1. Run top to view process resource usage. The screenshot shows low overall usage, but the Java process (PID 1) consumed 130% CPU on a multi‑core node, indicating the culprit.
2. Use top -H -p pid to list threads of the high‑CPU process.
top -H -p pid3. Identify the thread ID (tid) from the table; the problematic thread had tid 746.
4. Convert the decimal tid to hexadecimal because stack traces use hex IDs:
printf "%x
" 7465. Capture the stack of that thread with jstack:
jstack pid | grep 2ea >gc.stack6. Transfer the generated gc.stack file to a local machine. The author started a simple HTTP server with Python:
python -m SimpleHTTPServer 8080Then used curl to download the file:
curl -o http://<em>IP</em>/gcInfo.stack7. Opening the stack locally and searching for the hex thread ID revealed the stack trace pointing to an implementation method involved in asynchronous Excel export.
8. The root cause was a loop that exported Excel using a shared list query limited to 200 records per page, while the export required tens of thousands of records. The nested loops and large object lifetimes triggered frequent GC and eventually a full GC, causing the pod to restart.
After fixing the code and redeploying, the issue was resolved.
3. Conclusion
The author admits initial fear but emphasizes staying calm, verifying service availability, and checking whether the problem affects all pods before scaling or restarting. Resolving the issue on a single pod was satisfying, and the experience reinforced the importance of proactive debugging and continuous learning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
