Debugging Dubbo Service Hang: Thread‑Pool Exhaustion Caused by HTTP Read Timeout
A sudden Dubbo service hang was traced to thread‑pool exhaustion after 500 concurrent HTTP calls, each blocked by an overly long read‑timeout and processed through a hidden 5‑thread queue, causing a 200‑second backlog that flooded the Dubbo pool with hundreds of thousands of tasks.
Incident Overview
One evening, the Youzan distribution team received an urgent alarm: a core distribution service became completely unresponsive. All Dubbo RPC calls timed out, the Dubbo thread pool was fully occupied, and a massive backlog of pending tasks accumulated.
Investigation Steps
2.1 QPS – The QPS chart showed no spike; traffic remained relatively stable, ruling out a sudden traffic surge.
2.2 GC – GC logs revealed no abnormal Stop‑The‑World pauses, so GC was not the cause.
2.3 Slow Queries – No slow database queries were observed during the incident window.
2.4 Timed‑Out Requests – Log analysis uncovered that a normally quiet HTTP endpoint was called over 500 times within one second. More than 400 error logs followed, each reporting a “Read timed out” error, with some logs delayed by several minutes.
The delayed error logs indicated that the HTTP client’s read timeout setting was too long, causing threads to be blocked for an extended period, similar to a slow query.
Root Cause Analysis
The team suspected the RestTemplateBuilder used for outbound HTTP calls. A local reproduction created 500 concurrent threads that each invoked an HTTP endpoint which deliberately slept for 2 seconds before responding. The responses were successful, and the round‑trip times increased in a predictable 5‑request batch pattern.
Further code tracing revealed that the HTTP client internally uses a queue‑based pool with an active size of only 5. Consequently, the 500 concurrent requests were processed sequentially in groups of five, taking roughly 500 / 5 * 2 s = 200 s to finish.
During those 200 seconds, the service continued to receive about 3 000 QPS, generating roughly 600 000 tasks that were queued in the Dubbo thread pool, exhausting its capacity and causing the observed “hang”. After the 500 queued requests completed, the system gradually recovered.
Takeaways
1. HTTP request queues – When using I/O‑heavy operations, a queue‑based pool can improve throughput but must be sized appropriately. The pool’s active size should reflect expected concurrency and latency.
2. Dubbo thread‑pool configuration – Both the queue length and rejection policy are critical. An unbounded queue can lead to massive backlogs; a bounded queue with a sensible rejection strategy helps the system recover faster.
3. Understanding third‑party libraries – Even well‑known tools like Spring’s RestTemplateBuilder may have hidden defaults (e.g., internal queues) that affect performance.
Overall, the case demonstrates how a seemingly innocuous HTTP read‑timeout setting can cascade into thread‑pool exhaustion and service outage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
