How a Hidden Memory Leak Crashed Our Java Monitoring Service (And How We Fixed It)

During a weekend on‑call shift a Java monitoring service repeatedly timed out due to network packet loss and a severe memory leak, leading to massive CPU usage and full GC cycles, which were diagnosed with jstat, jstack, heap dumps and MAT before the leak was eliminated.

Efficient Ops
Efficient Ops
Efficient Ops
How a Hidden Memory Leak Crashed Our Java Monitoring Service (And How We Fixed It)

Background

During a recent on‑call rotation our team handled alert emails, bug investigations, and operational issues. Weekday shifts were manageable, but a weekend shift turned disastrous when network problems caused frequent timeouts in our detection service.

Problem

Network Issue?

At around 7 pm I started receiving alert emails indicating timeouts on several endpoints. The stack traces repeatedly showed reads blocked in java.io.BufferedReader.readLine:

java.io.BufferedReader.readLine(BufferedReader.java:371)
java.io.BufferedReader.readLine(BufferReader.java:389)
java_io_BufferedReader$readLine.call(Unknown Source)
com.domain.detect.http.HttpClient.getResponse(HttpClient.groovy:122)
com.domain.detect.http.HttpClient.this$2$getResponse(HttpClient.groovy)

Our HTTP DNS timeout was set to 1 s, connect timeout to 2 s, and read timeout to 3 s. These errors meant the request reached the server and was processed, but the response packets were lost in the network, leaving the client thread stuck waiting for data.

One endpoint that uploaded a 4 MB file and returned a 2 MB response timed out more often, suggesting larger payloads increased packet‑loss probability. Log searches confirmed network‑level packet loss as the cause.

Problem Escalation

Later, around 8 pm, alerts flooded in for almost every interface, especially the high‑I/O endpoint, raising concerns of a data‑center failure. Server metrics appeared normal, and manual tests succeeded, but attempts to stop the detection tasks timed out, indicating a deeper issue.

Resolution

Memory Leak

Logging into the detection server revealed abnormal CPU usage (up to 900%). The Java process should normally stay between 100–200% CPU, so such a spike suggested either an infinite loop or excessive garbage collection.

Running jstat -gc pid [interval] showed full GC occurring once per second.

We captured a thread dump with jstack pid > jstack.log and a heap dump with jmap -dump:format=b,file=heap.log pid, then restarted the service, which stopped the alerts.

jstat

jstat

is a powerful JVM monitoring tool. Common options include:

-class Show class‑loading information

-compile Show compilation statistics

-gc Show garbage‑collection information

-gcXXX Detailed GC info for specific regions (e.g., -gcold)

It is very helpful for locating JVM memory problems.

Investigation

Stack Analysis

We counted thread states:

grep 'java.lang.Thread.State' jstack.log | wc -l
464

Only about 460 threads were present, with no obvious anomalies. Further analysis of the stack traces showed most threads waiting in native methods or parking.

Download Heap Dump

Heap dumps are large binary files; we compressed them with gzip -6 before transferring to a local machine for analysis.

Analyze JVM Heap with MAT

Using Eclipse MAT on the .hprof file, we selected the “memory leak suspect” report. The dominant memory consumer was a single object, leading us to the culprit.

Code Analysis

The leak originated from a Bean containing a Map that stored each detection result in an ArrayList. Because the Bean was never reclaimed and the map lacked cleanup logic, its size grew continuously until the heap was exhausted, causing the read‑line blockage.

We submitted a PR to clear the map, eliminating the leak.

Conclusion

Initially, alert emails showed stack traces like:

groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:166)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
...

Such traces indicate internal errors rather than network issues; careful analysis of thread stacks can reveal problems early. Comprehensive monitoring and timely heap analysis are essential to prevent similar incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaJVMperformancebackend-developmentmemory leak
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.