Analyzing and Resolving Unexpected JVM GC Spikes in Production
This article recounts a production incident where a container's CPU spiked due to abnormal JVM garbage collection, walks through step-by-step diagnostics using top, jstack, and thread analysis, explains the root cause in an Excel export routine, and concludes with a brief promotion of a ChatGPT community.
Preface
On a Friday while writing documentation, an alarm triggered showing CPU usage over 90%. The JVM monitoring revealed a pod that performed 61 Young GC and one Full GC within two hours, a rare and severe issue.
Scenario
The author first shows a normal GC curve (illustrated) and then the problematic GC curve, highlighting the abnormal frequency of GC events.
Detailed Analysis
The problem was isolated to a single pod among many. The troubleshooting steps were:
Enter the pod and run top to view process resource usage.
Identify the Java process (PID 1) consuming high CPU (130% on multi‑core).
Run top -H -p <pid> to list threads and find the thread ID (TID) with the highest CPU.
Convert the TID (e.g., 746) to hexadecimal using printf "%x\n" 746.
Extract the stack trace of that thread with jstack <pid> | grep 2ea > gc.stack.
Download the generated gc.stack file via a temporary Python HTTP server and curl for local analysis.
Search the stack file for the hexadecimal thread ID (2ea) to locate the offending code.
The investigation revealed that the issue originated from an asynchronous Excel export feature that reused a common list query interface limited to 200 items per page, while the export required tens of thousands of records. The nested loops and repeated List allocations caused massive GC activity, eventually affecting other pods.
The fix involved refactoring the export logic to avoid excessive object creation and redeploying the corrected code, which resolved the CPU and GC spikes.
Conclusion
When encountering production problems, first ensure service availability, then methodically analyze limited information layers to pinpoint the root cause. Familiarity with tools like arthas can further simplify troubleshooting.
Note: The article also includes a promotional segment inviting readers to join a ChatGPT community offering resources, mentorship, and exclusive benefits.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
