Diagnosing and Resolving Thread Leak in a Java Backend System
This article describes how the author investigated recurring errors in a Java-based lecture hall system, identified thread pool leaks caused by improper use of AsyncHttpClient, and resolved the issue through command‑line diagnostics, code fixes, and system monitoring, improving stability.
1. Problem Background
The lecture hall and points management systems were stable but occasionally failed, preventing users from registering for courses. Errors occurred roughly once every half‑month and were temporarily fixed by restarting the service, suggesting a resource leak.
2. Investigation Approach
Since the original developers were unavailable, the investigation started by gathering system information and logs. Two possible leak sources were considered: object (memory) leak and thread leak.
3. Investigation Process
Step 1: Identify Java process ID
sudo -u tomcat jps -lv | grep qtscore
Step 2: Check for memory leaks
GC logs showed normal frequency and no frequent Full GC. Heap histogram was examined:
sudo -u tomcat jmap -F -histo 11035
The top memory consumers were not business code, so a memory leak was largely ruled out.
Step 3: Check for thread leaks
top -H -p 11035
The process had 4038 threads, far exceeding the default Dubbo (200) and Tomcat (200) thread pool sizes, indicating a thread leak. Stack traces were captured:
sudo -u tomcat jstack -l 11035 > /tmp/qtscore_stack.log
The stack contained many "New I/O boss" Netty threads but no business‑logic code, pointing to a Netty‑based thread‑pool leak. Log analysis revealed frequent errors from a Dubbo service interface.
Further code review found an AsyncHttpClient usage:
AsyncHttpClient creates a new Netty thread pool on each instantiation. The instance should be reused instead. The code was corrected to reuse a single AsyncHttpClient instance.
After redeploying, thread count stabilized around 350, confirming the fix. Additional log inspection showed errors caused the process to exceed the system’s maximum user threads (4096), as revealed by ulimit -a .
4. Lessons Learned
Core components like Dubbo are generally reliable; focus on custom code when issues arise.
System logs often point directly to the root cause; thorough log analysis speeds up troubleshooting.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.