How I Fixed a Weekend Network Outage and JVM Memory Leak in Our Monitoring Service

During a weekend on‑call shift I traced repeated timeout alerts to network packet loss and a severe JVM memory leak, used jstat, jmap and Eclipse MAT to pinpoint the culprit, and resolved the issue by fixing a never‑cleared Map in the Java code.

dbaplus Community
dbaplus Community
dbaplus Community
How I Fixed a Weekend Network Outage and JVM Memory Leak in Our Monitoring Service

Problem

While on a weekend on‑call shift the monitoring service started sending a flood of timeout alarm emails. The thread stacks showed the request threads stuck on reading the HTTP response, even though the server logs indicated normal processing. The root cause was network packet loss, likely aggravated by a large 4 MB upload and 2 MB download on one of the interfaces.

Investigation

Further alerts arrived around 20:00, affecting almost all interfaces. Server metrics looked normal, and manual tests succeeded, so the issue seemed isolated to the monitoring probes. Attempts to pause the probes failed, indicating a deeper problem.

Resolution

Memory leak detection

Login to the probe server revealed a CPU usage of 900 % for the Java process, far above the usual 100‑200 %. This suggested either an infinite loop or excessive garbage collection. Running jstat -gc pid [interval] showed a FULL GC occurring once per second.

Full GC frequency indicated a memory leak. Thread stacks were saved with jstack pid > jstack.log and a heap dump was created with jmap -dump:format=b,file=heap.log pid The service was then restarted, stopping the alarms.

jstat options

class – class loading information

compile – compilation statistics

gc – garbage collection information

gcXXX – detailed GC info for each region (e.g., -gcold)

Heap dump analysis

The 4 GB heap dump was large, so it was first compressed with gzip -6. The dump (renamed to .hprof) was opened in Eclipse MAT, selecting the "Memory Leak Suspect" report.

The analysis showed that a single object consumed the majority of heap memory. Drilling down revealed the offending object.

Code fix

Search for the leaked object identified a Bean containing a Map<String, List<Response>>. Each probe result was appended to an ArrayList inside this Map, but the Bean was never reclaimed and the Map lacked any cleanup logic. Over ten days of continuous operation the Map grew until it exhausted the JVM heap, causing the read‑line block and the timeout alerts.

A pull request was submitted to clear the Map or redesign the data structure, and the issue was resolved.

Conclusion

The incident highlights the importance of monitoring JVM metrics, analyzing thread stacks, and regularly reviewing long‑running services for hidden memory leaks. Early detection of abnormal GC activity and proper resource cleanup can prevent network‑related timeouts and service disruptions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaJVMMemoryLeak
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.