How We Traced and Fixed a Netty Off‑Heap Memory Leak in a WebSocket Service
When a WebSocket‑based service built on Netty started returning massive 5xx errors, we used log analysis, CAT monitoring, reflective access to Netty's internal memory counter, and step‑by‑step debugging to locate and fix an off‑heap memory leak caused by a null subType field in the encoder.
Netty is an asynchronous event‑driven network framework built on JDK NIO, simplifying TCP/UDP socket programming.
In a WebSocket‑based long‑connection middleware we used the netty‑socketio library (a Netty implementation of the Socket.IO protocol). During production we observed frequent 5xx errors from Nginx.
Using Meituan’s open‑source monitoring platform CAT we discovered two anomalies at the same timestamp: a GC spike and JVM thread blockage.
Problem
An alert indicated massive 5xx responses, suggesting the backend service was unavailable.
Investigation Process
Stage 1 – Suspect log4j2
We first checked log4j2 configuration and found a console appender that printed excessive logs, blocking NIO threads. Disabling it did not stop the 5xx alerts.
Stage 2 – Suspicious log entries
Log files showed repeated lines like failed to allocate 64(bytes) of direct memory(...) and an OutOfDirectMemoryError, indicating off‑heap memory exhaustion.
Stage 3 – Locate OOM source
We traced the Netty class PlatformDependent, which updates the static counter DIRECT_MEMORY_COUNTER before each off‑heap allocation and throws a custom OOM error when the limit is exceeded.
Stage 4 – Reflective monitoring
Since CAT did not report off‑heap usage accurately, we used Java reflection to access DIRECT_MEMORY_COUNTER and printed its value every second.
Stage 5 – Growth pattern
After deployment the counter started at 16 MiB (the default chunk size) and then grew slowly without being released, eventually reaching nearly 1 GiB over a weekend.
Stage 6 – Local reproduction
Running the service locally with non‑pooled memory, we observed that each WebSocket disconnect caused an immediate 256 B increase in off‑heap memory that never decreased.
Stage 7 – Source‑level debugging
Stepping through the code we narrowed the leak to the encoder.encodePacket() path, where a null subType caused an NPE and prevented the allocated memory from being released.
Stage 8 – Bug fix
We fixed the NPE by ensuring subType is set (e.g., to DISCONNECT), rebuilt the library, and pushed the changes to our internal repository.
Stage 9 – Local verification
After the fix, repeated connect‑disconnect cycles no longer increased off‑heap memory.
Stage 10 – Production verification
We instrumented the custom counter to report to CAT; the metric remained stable, confirming the leak was resolved.
Conclusion
Off‑heap memory leaks can be diagnosed by careful log analysis and reflective monitoring.
Netty’s internal counter can be accessed without third‑party tools.
Systematic narrowing, thread‑level debugging, and binary search in the code are effective for locating leaks.
IDE debugging shortcuts (pre‑execution, thread stack inspection) accelerate the process.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
