Why Did My Dubbo Thread Pool Deadlock? A Deep Dive into CompletableFuture Blocking
The article analyzes a production incident where a Dubbo thread pool exhausted its threads due to CompletableFuture#join blocking, explains how the custom business thread pool caused mutual waiting, and presents a solution that isolates asynchronous tasks into a separate pool to restore service stability.
1. Problem Background
Online monitoring detected a large number of interface errors. The error log showed that the Dubbo thread pool had reached its maximum of 200 active threads, indicating that the thread‑pool resources were exhausted.
2. Investigation
It was suspected that Dubbo threads were blocked by a time‑consuming method or a sudden traffic spike. Monitoring confirmed normal request traffic, so the focus shifted to thread‑pool blockage.
2.1 Why were Dubbo threads blocked?
Using jstack the stack traces revealed many Dubbo threads in the WAITING state, blocked on CompletableFuture#join. The simplified code showed a custom fixed thread pool (size 8) referred to as the "business thread pool". The method method2 fetched threads from this pool to execute several sub‑tasks and then called join to wait for their completion.
When a request arrived, a Dubbo thread invoked method1, which in turn called method2. method2 obtained threads from the business pool to run sub‑tasks and blocked waiting for them. Because all sub‑tasks also used the same business pool, the Dubbo thread and its sub‑tasks entered a circular wait, causing permanent blockage.
2.2 Why did the sub‑tasks never finish?
Stack traces of the eight business threads also showed the WAITING state on CompletableFuture#join. The code diagram indicated that both method3 (asynchronous call) and method2 (synchronous call) shared the same business thread pool. When eight requests arrived simultaneously, all eight business threads were occupied by method3 and queued sub‑tasks of method2. Since the business threads were already waiting in method2, the queued sub‑tasks could not be executed, resulting in a deadlock.
2.3 Summary
All threads in the business thread pool were blocked, which in turn blocked every Dubbo thread that relied on this pool.
3. Solution
The real code path is more complex than a simple A‑calls‑B scenario. Directly converting method2 to sequential execution would increase latency for other interfaces. The recommended fix is to isolate thread pools: use one pool for submitting asynchronous tasks ( method3) and another for executing the sub‑tasks ( method2). This separation prevents the two sides from contending for the same threads and allows a quick, low‑impact deployment.
4. Deep Dive
Version control history revealed that the business recently switched from the default CompletableFuture thread pool to a custom fixed pool. The default pool had been stable for a long time.
4.1 Can the default thread pool work safely?
A test program created 1,000 CompletableFuture tasks. All tasks completed without deadlock, demonstrating that the default pool can handle the workload.
4.2 Why does the default pool avoid deadlock?
CompletableFutureinternally chooses between two executors:
If ForkJoinPool#getCommonPoolParallelism() > 1, it uses the common ForkJoinPool. Otherwise it falls back to ThreadPerTaskExecutor , which creates a new thread for each task.
On the test machine the common pool parallelism was 7, so the ForkJoinPool was used. The ForkJoinPool creates ForkJoinWorkerThread instances and, when a thread calls ForkJoinPool#managedBlock, it may invoke tryCompensate to create an extra thread if the current thread would otherwise block indefinitely.
The compensation logic can create up to 32,767 threads for a regular ForkJoinPool and up to parallelism+256 threads for the common pool before throwing an exception.
4.3 Summary
The default pool avoids mutual waiting because ThreadPerTaskExecutor always spawns a fresh thread, while the common ForkJoinPool attempts to compensate blocked tasks by creating additional worker threads, ensuring enough resources to complete the work.
5. Final Thoughts
The root cause was that parent and child tasks shared the same fixed thread pool, leading to a circular wait when the parent waited for children. In future projects, avoid using a single thread pool for both parent and child tasks; isolate asynchronous calls into separate pools to prevent similar deadlocks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
