How I Traced a Sudden CPU Spike to JVM GC Issues in a Container
After receiving an alarm that a production container’s CPU usage surged past 90%, I investigated the JVM metrics, discovered excessive young and full GCs in a single pod, and walked through the detailed troubleshooting steps—including top, thread analysis, jstack, and code fixes—that resolved the issue.
Background
On Friday, while writing documentation, an alarm indicated a container’s CPU usage exceeded 90%. Monitoring showed a pod generating 61 young GCs and one full GC within two hours, a rare and serious issue.
Normal JVM Monitoring Curve
Typical JVM metrics show minimal GC activity.
Problematic JVM Monitoring Curve
The problematic pod exhibits frequent GC events, including a full GC.
Detailed Analysis
Enter the pod and run top to view Linux process resource usage.
Identify the Java process (PID 1) consuming high CPU (130% on multi‑core).
Run top -H -p <pid> to find the thread ID (tid) with the highest CPU usage.
Convert the tid (e.g., 746) to hexadecimal using printf "%x\n" 746 because thread IDs appear in hex in stack traces.
Execute jstack <pid> | grep <hex_tid> > gc.stack to extract the stack trace of the offending thread.
Download the gc.stack file via a temporary Python HTTP server and curl to a local machine for easier inspection.
Search the stack trace for the method containing the hex tid (e.g., 2ea) and locate the corresponding implementation in the source code.
Discover that the Excel export feature reuses a common list‑query API, which paginates only 200 records per batch, while the export may request tens of thousands of records, leading to nested loops and excessive object creation.
Fix the code by avoiding the shared list and optimizing the export logic, then redeploy the fix, which eliminates the GC spikes.
Conclusion
The incident was frightening at first, but systematic troubleshooting—checking service availability, isolating the affected pod, and analyzing JVM metrics—allowed a quick resolution. The experience reinforced the importance of staying calm, investigating thoroughly, and promptly fixing performance‑critical code.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
