Operations 9 min read

Troubleshooting a JVM Memory Leak and Network Timeout Issue in a Monitoring Service

The article recounts a weekend on‑call incident where a Java monitoring service suffered network packet loss and a severe memory leak, leading to massive timeouts, high CPU usage, and frequent GC, and explains how the problem was diagnosed and resolved using tools such as top, jstat, jstack, jmap, and MAT.

Top Architect
Top Architect
Top Architect
Troubleshooting a JVM Memory Leak and Network Timeout Issue in a Monitoring Service

During a weekend on‑call shift the author received a flood of alert emails indicating timeout errors in several monitoring service endpoints, especially one that uploaded a 4 MB file and returned a 2 MB response.

The stack trace showed java.io.BufferedReader.readLine calls, suggesting that the HTTP request reached the server and was processed correctly but the response packets were lost in the network, a conclusion confirmed by searching recent logs with the request ID.

Contacting the network team revealed an aging switch in the data‑center causing intermittent packet loss, which explained the sporadic timeouts.

Later, all endpoints began timing out; attempts to stop the monitoring tasks hung, indicating a deeper issue. Investigation with top showed the Java process consuming ~900 % CPU, and jstat -gc revealed a Full GC occurring once per second.

These symptoms pointed to a memory leak. The team captured thread dumps ( jstack pid > jstack.log ) and heap dumps ( jmap -dump:format=b,file=heap.log pid ), then restarted the service, which stopped the alerts.

The article explains how to use jstat for JVM monitoring, listing its common options ( -class , -compile , -gc , -gcXXX ).

Further analysis of the thread dump ( grep 'java.lang.Thread.State' jstack.log | wc -l ) showed about 464 threads, with most in harmless states. Stack analysis commands were provided to identify hot spots.

Using Eclipse MAT on the .hprof heap file, the author identified that a singleton Bean contained a Map that stored every probe result in an ArrayList . Because the Bean was never reclaimed and the map lacked cleanup logic, its size grew until the JVM ran out of memory, causing the observed Full GC and timeouts.

After locating the leaking object, a pull request was submitted to clear the map, fixing the leak and stabilizing the service.

The author concludes with a reminder to scrutinize stack traces and consider network reliability, as overlooking such clues can delay problem detection.

JavaJVMMonitoringMemory LeakNetwork Timeoutjstackjstat
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.