Operations 7 min read

Why Did My Kubernetes Pod Trigger 61 GC Events? A Step‑by‑Step Debugging Guide

The author recounts a sudden CPU spike on a Kubernetes pod caused by excessive JVM garbage collection, walks through step‑by‑step diagnostics using top, thread inspection, jstack, and network file transfer, identifies a flawed Excel export loop, and shares the fix and lessons learned.

Java Backend Technology
Java Backend Technology
Java Backend Technology
Why Did My Kubernetes Pod Trigger 61 GC Events? A Step‑by‑Step Debugging Guide

1. Scenario

The issue appeared on a Friday when a documentation task was interrupted by an alarm: CPU usage jumped above 90%. Monitoring showed a pod generating 61 young GC events and one full GC within two hours, a rare and severe problem.

Normal JVM monitoring curve

Below is a typical GC curve where GC events are infrequent.

Problematic JVM monitoring curve

The second image shows the abnormal spike with many GC events, including a full GC.

2. Detailed Analysis

The abnormal GC occurred only on a single pod. After locating the pod, the author entered it and began systematic investigation.

1. Run

top

to view process resource usage. The screenshot shows low overall usage, but the Java process (PID 1) consumed 130% CPU on a multi‑core node, indicating the culprit.

2. Use

top -H -p pid

to list threads of the high‑CPU process.

top -H -p pid

3. Identify the thread ID (tid) from the table; the problematic thread had tid 746.

4. Convert the decimal tid to hexadecimal because stack traces use hex IDs:

printf "%x
" 746

5. Capture the stack of that thread with jstack:

jstack pid | grep 2ea >gc.stack

6. Transfer the generated

gc.stack

file to a local machine. The author started a simple HTTP server with Python:

python -m SimpleHTTPServer 8080

Then used

curl

to download the file:

curl -o http://<em>IP</em>/gcInfo.stack

7. Opening the stack locally and searching for the hex thread ID revealed the stack trace pointing to an implementation method involved in asynchronous Excel export.

8. The root cause was a loop that exported Excel using a shared list query limited to 200 records per page, while the export required tens of thousands of records. The nested loops and large object lifetimes triggered frequent GC and eventually a full GC, causing the pod to restart.

After fixing the code and redeploying, the issue was resolved.

3. Conclusion

The author admits initial fear but emphasizes staying calm, verifying service availability, and checking whether the problem affects all pods before scaling or restarting. Resolving the issue on a single pod was satisfying, and the experience reinforced the importance of proactive debugging and continuous learning.

JVMoperationsKubernetesJava performanceGC
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.