Why Tomcat Thread‑Pool Saturation Crashed Our Service and How to Avoid It
A detailed post‑mortem explains how a sudden traffic surge, insufficient pod count, and a custom thread‑pool bottleneck caused Tomcat thread‑pool saturation, health‑check failures, and a zone‑wide outage, and offers concrete lessons on capacity planning, monitoring, and safe coding practices.
1 Background
On a sunny Friday morning the author was driving to work when at 08:50 Feishu sent an urgent alert: a core application (RT) spiked, causing zone‑2 to become unavailable. Emergency actions (rate‑limiting, fallback, scaling) restored the service after about ten minutes.
2 Root‑Cause Diagnosis
The failure was traced to a newly deployed service bfe‑customer‑application‑query‑svc with only two pods per zone. The previous day the confirmEvaluate API traffic was increased from 50% to 100%, doubling its QPS.
Initial clues:
The service had only two pod instances.
The QPS of confirmEvaluate roughly doubled.
Latency of a downstream call getTagInfo jumped from ~10 ms to 500 ms, causing RT to rise to 500 ms.
These indicated a capacity problem, but the exact resource shortage was unclear.
2.1 Tomcat Thread‑Pool Saturation
Pod monitoring ruled out CPU and memory exhaustion. The app does not use a database, so connection‑pool limits were irrelevant. The focus turned to Tomcat’s thread pool.
Metrics showed that both pods reached the maximum number of threads (maxThreads) at 08:54 and 08:53:30 respectively, meaning the thread pool was fully occupied.
Tomcat’s thread pool extends java.util.concurrent.ThreadPoolExecutor and relies on its execute method. The expansion logic is:
If current workers < corePoolSize, add a new worker.
Otherwise try to queue the task.
If queuing fails, attempt to add a new worker (the expansion step).
The custom TaskQueue (a subclass of LinkedBlockingQueue) overrides offer:
If the pool is at max threads, put the task into the queue.
If workers are below max and submittedCount < workers, the task is queued and will be taken immediately.
Otherwise return false to trigger pool expansion.
When the pool reaches the “stage 2” state, new tasks queue up, leading to request back‑pressure.
2.2 Impact on Health‑Check Probes
The service uses SpringBoot’s actuator/health endpoint as a liveness probe. When the thread pool is saturated, the probe request (R5) must wait behind queued business requests (R1‑R4), often exceeding the 1 s probe timeout, causing the pod to be marked unhealthy and restarted.
Pod logs showed health‑check failures on both pods at the moments when the thread pool hit max threads, confirming the link between saturation and pod restarts.
2.3 Deep Dive into the Bottleneck
Further analysis revealed that the confirmEvaluate method creates two sub‑tasks submitted to a custom thread pool BizExecutorsUtils.EXECUTOR (max 20 threads, queue > 1 000). The sub‑tasks call future.get() without a timeout, blocking the Tomcat thread until the sub‑task finishes.
This custom pool’s capacity (20 threads) is only 1/10 of the Tomcat pool (200 threads) but must handle twice the incoming request rate after the traffic cut‑over, creating a “bottleneck effect”.
Trace data showed a ~6 s delay between the upstream call returning and the sub‑tasks actually starting, confirming long queue times in the custom pool.
2.4 Diagnosis Conclusion
The outage was caused by:
Insufficient pod count (only two pods per zone).
Traffic surge after switching from 50 % to 100 % flow.
Custom thread pool with too few workers and unbounded queue.
Blocking future.get() calls without timeout, causing Tomcat threads to wait.
The combination exhausted Tomcat’s worker threads, leading to health‑check failures and zone‑wide downtime.
3 Recommendations
3.1 Ensure Adequate Redundancy and Capacity
Core services should run with more than the minimal two‑pod redundancy; additional pods provide a safety margin for traffic spikes and resource saturation.
3.2 Perform Realistic Load Tests
Every new service, especially I/O‑intensive ones, must be stress‑tested under expected traffic patterns, including any custom thread‑pool configurations.
3.3 Monitor Finite Resources
Expose metrics for thread‑pool usage, queue length, and connection‑pool saturation. Early alerts on high utilization can prevent cascading failures.
4 Conclusion
The post‑mortem demonstrates how a seemingly minor configuration (a custom executor) can amplify traffic spikes into a full outage. By applying rigorous capacity planning, thorough testing, and comprehensive monitoring, similar incidents can be avoided.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
