Why a Custom Thread Pool Crashed Our Service: Deep Dive into Tomcat Thread Saturation
An October 2023 outage of the bfe‑customer‑application‑query‑svc revealed how traffic spikes, downstream latency, and a mis‑configured custom thread pool saturated Tomcat's worker threads, causing health‑check failures, pod restarts, and a complete zone failure.
Incident Overview
On 27 Oct 2023 the core service bfe‑customer‑application‑query‑svc experienced a rapid increase in response time, causing zone‑2 to become unavailable. Each zone ran only two Pods, and each Pod used a Tomcat thread pool with a maximum of 200 worker threads.
Initial Diagnosis
Key observations:
The service had just been released and each zone contained only two Pods.
The confirmEvaluate flow was switched from 50 % to 100 % traffic the night before, doubling QPS (see Figure 1).
Latency spikes in the downstream getTagInfo interface raised response times from ~10 ms to ~500 ms.
These factors indicated a capacity‑shortage problem.
Tomcat Thread‑Pool Saturation
Monitoring showed that both Pods quickly reached Max Threads = 200 and the number of available threads dropped to zero (Figures 2‑3). When the pool is full, new tasks are queued, creating back‑pressure.
Tomcat’s pool extends java.util.concurrent.ThreadPoolExecutor. Its execution logic can be expressed as:
if (workerCount < corePoolSize) {
addWorker(task, true);
} else if (workQueue.offer(task)) {
// task queued
} else if (workerCount < maximumPoolSize) {
addWorker(task, false);
}The custom TaskQueue overrides offer() to prefer queuing while the pool is below its maximum size and to reject when the pool is saturated.
Two expansion stages emerge:
Stage 1 : workerCount < maximumPoolSize – new threads are created for each incoming task ("aggressive growth").
Stage 2 : workerCount == maximumPoolSize – tasks can only be queued, causing latency and possible time‑outs.
Health‑Check Failure Cascade
When the pool entered Stage 2, health‑check requests (Spring Boot /actuator/health) were forced to wait in the queue. In zone‑2 the queued health checks timed out after 1 s, triggering Kubernetes pod restarts. With only two Pods, both restarted, leaving the zone completely down.
Call‑Stack of the SOA Call
The service uses a custom SOA framework ( hermes ) that performs HTTP calls via CloseableHttpAsyncClient. The call flow is:
Submit request to asyncClient.execute(..., callback).
The callback implements FutureCallback with completed(), failed(), cancelled().
The invoking thread blocks on future.get() (no timeout), putting the thread into WAITING state.
Because future.get() blocks indefinitely, Tomcat worker threads remain occupied while the async client processes the request in separate IOReactorWorker threads.
Custom Thread‑Pool Bottleneck
The confirmEvaluate method creates two sub‑tasks and submits them to a custom pool BizExecutorsUtils.EXECUTOR:
Maximum pool size: 20 threads.
Work‑queue capacity: > 1000 tasks.
During the traffic surge each Pod received ~20 requests / s; each request spawns two sub‑tasks that each take ~1 s. The custom pool quickly saturated, leaving many tasks queued (Figure 25). Consequently the main Tomcat thread waited on future.get() for each sub‑task, exhausting the Tomcat pool and causing the health‑check backlog.
Quantitative Estimation
Before the latency spike each Pod handled ~18 requests / s. After the 50 % → 100 % traffic shift the request rate doubled to ~36‑40 requests / s. Each request generated two sub‑tasks, so the custom pool needed to process ~70‑80 tasks per second, far beyond its 20‑task‑per‑second capacity, leading to the "bottleneck effect".
Root Cause and Contributing Factors
Root cause : Misuse of a custom thread pool with insufficient capacity, creating a bottleneck.
Trigger : Downstream getTagInfo latency spike.
Secondary factors :
Traffic increase due to flow‑switch.
Only two Pods per zone.
Health‑check timeout shorter than worst‑case request latency.
User retries amplifying load.
Recommendations
Avoid custom thread pools for latency‑sensitive paths; use the container‑managed pool or a properly sized dynamic pool.
Monitor key finite resources (Tomcat thread pool, custom pool busy threads, connection pools) with alerts before saturation.
Perform load‑testing that includes realistic traffic spikes and downstream latency variations.
Maintain sufficient pod replicas for core services to absorb sudden load and avoid single‑point failures.
Set health‑check timeouts longer than the worst‑case request latency or make them independent of the main request path.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
