How I Traced and Fixed a Netty Off‑Heap Memory Leak in a WebSocket Service
This article details a step‑by‑step investigation of a Netty off‑heap memory leak that caused Nginx 5xx errors, covering background, alert analysis, multiple debugging stages, reflective monitoring, the root‑cause NPE fix, and verification in both local and production environments.
0. Introduction
Netty is an asynchronous event‑driven network framework that simplifies TCP/UDP socket programming and enables rapid development of high‑performance server and client applications.
Why choose Netty over plain JDK NIO?
JDK NIO requires many concepts and is complex.
Netty’s IO model can be switched with minimal changes.
Built‑in packet framing, exception detection, etc., let you focus on business logic.
Netty fixes many JDK bugs, including empty‑polling.
Optimized thread and Selector handling; its Reactor threads handle concurrency efficiently.
Provides protocol stacks so you rarely need to implement them yourself.
Active community with mailing lists and issue tracker.
Proven in major RPC frameworks (Dubbo), messaging middleware (RocketMQ) and big‑data systems (Hadoop).
1. Background
We built a long‑living middleware based on WebSocket using the netty‑socketio library, which implements the Socket.IO protocol on top of Netty. While the framework is well‑regarded, we encountered an off‑heap memory leak.
2. Alert
One morning Nginx reported a large number of 5xx responses, indicating the backend service was unavailable.
CAT monitoring showed two anomalies on a particular machine: a GC spike and JVM thread blockage at the same timestamp.
The logs revealed massive Log4j2 console output that blocked Netty NIO threads, causing the 5xx errors.
3. Investigation Process
Stage 1 – Suspect Log4j2
Thread blockage was traced to Log4j2 flooding the console. Commenting out the console appender temporarily stopped the 5xx alerts, but the problem re‑appeared days later.
Stage 2 – Suspicious Log Entries
Near the failure point the logs repeatedly contained: failed to allocate 64(bytes) of direct memory(...) followed by Netty’s OutOfDirectMemoryError, indicating off‑heap memory exhaustion.
Stage 3 – Locate OOM Source
Searching the code revealed the class PlatformDependent that maintains a DIRECT_MEMORY_COUNTER. When the used memory exceeds the user‑defined limit, Netty throws a custom OOM error.
Stage 4 – Reflective Monitoring
Since CAT did not report off‑heap usage, we used reflection to obtain the DIRECT_MEMORY_COUNTER field and printed its value every second.
Two possibilities remained:
Sudden allocation of a large amount of off‑heap memory.
Slow growth that eventually reaches the limit.
Stage 5 – Slow Growth or Spike?
After deployment the initial off‑heap memory was 16 MiB (one chunk). Within minutes the usage began to rise slowly without release, reaching ~1 GiB after a weekend.
Stage 6 – Local Simulation
Running the service locally with non‑pooled memory showed that each client disconnect caused a 256 B increase that never released.
Stage 7 – Debugging Disconnect
Stepping through onDisconnect showed that after the method returns, memory jumps by 256 B. The increase originated from the encoder allocating a buffer and then throwing a NPE when processing the packet’s subType field.
Stage 8 – Bug Fix
Ensuring the subType field is never null (setting it to DISCONNECT) prevents the NPE and stops the off‑heap leak.
Stage 9 – Local Verification
After rebuilding and redeploying, repeated connect/disconnect cycles no longer increased off‑heap memory.
Stage 10 – Production Verification
Deploying the fix to the cluster and monitoring via CAT showed the off‑heap counter stabilising below the limit.
Conclusion
Off‑heap memory leaks in Netty can be diagnosed by carefully analysing logs, monitoring the DIRECT_MEMORY_COUNTER, and using reflection to expose internal metrics.
When a leak is found, fixing the root cause (e.g., a null field that triggers an NPE) and adding custom monitoring prevents recurrence.
Systematic debugging—isolating the offending thread, stepping through code, and narrowing the suspect region—remains an effective strategy for locating hard‑to‑detect bugs.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
