Debugging Dubbo Service Hang: Thread‑Pool Exhaustion Caused by HTTP Read Timeout

A sudden Dubbo service hang was traced to thread‑pool exhaustion after 500 concurrent HTTP calls, each blocked by an overly long read‑timeout and processed through a hidden 5‑thread queue, causing a 200‑second backlog that flooded the Dubbo pool with hundreds of thousands of tasks.

Youzan Coder
Youzan Coder
Youzan Coder
Debugging Dubbo Service Hang: Thread‑Pool Exhaustion Caused by HTTP Read Timeout

Incident Overview

One evening, the Youzan distribution team received an urgent alarm: a core distribution service became completely unresponsive. All Dubbo RPC calls timed out, the Dubbo thread pool was fully occupied, and a massive backlog of pending tasks accumulated.

Investigation Steps

2.1 QPS – The QPS chart showed no spike; traffic remained relatively stable, ruling out a sudden traffic surge.

2.2 GC – GC logs revealed no abnormal Stop‑The‑World pauses, so GC was not the cause.

2.3 Slow Queries – No slow database queries were observed during the incident window.

2.4 Timed‑Out Requests – Log analysis uncovered that a normally quiet HTTP endpoint was called over 500 times within one second. More than 400 error logs followed, each reporting a “Read timed out” error, with some logs delayed by several minutes.

The delayed error logs indicated that the HTTP client’s read timeout setting was too long, causing threads to be blocked for an extended period, similar to a slow query.

Root Cause Analysis

The team suspected the RestTemplateBuilder used for outbound HTTP calls. A local reproduction created 500 concurrent threads that each invoked an HTTP endpoint which deliberately slept for 2 seconds before responding. The responses were successful, and the round‑trip times increased in a predictable 5‑request batch pattern.

Further code tracing revealed that the HTTP client internally uses a queue‑based pool with an active size of only 5. Consequently, the 500 concurrent requests were processed sequentially in groups of five, taking roughly 500 / 5 * 2 s = 200 s to finish.

During those 200 seconds, the service continued to receive about 3 000 QPS, generating roughly 600 000 tasks that were queued in the Dubbo thread pool, exhausting its capacity and causing the observed “hang”. After the 500 queued requests completed, the system gradually recovered.

Takeaways

1. HTTP request queues – When using I/O‑heavy operations, a queue‑based pool can improve throughput but must be sized appropriately. The pool’s active size should reflect expected concurrency and latency.

2. Dubbo thread‑pool configuration – Both the queue length and rejection policy are critical. An unbounded queue can lead to massive backlogs; a bounded queue with a sensible rejection strategy helps the system recover faster.

3. Understanding third‑party libraries – Even well‑known tools like Spring’s RestTemplateBuilder may have hidden defaults (e.g., internal queues) that affect performance.

Overall, the case demonstrates how a seemingly innocuous HTTP read‑timeout setting can cascade into thread‑pool exhaustion and service outage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Javathread poolHTTP TimeoutQueue
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.