Why a Custom Thread Pool Crashed Our Service: Deep Dive into Tomcat Thread Saturation

An October 2023 outage of the bfe‑customer‑application‑query‑svc revealed how traffic spikes, downstream latency, and a mis‑configured custom thread pool saturated Tomcat's worker threads, causing health‑check failures, pod restarts, and a complete zone failure.

dbaplus Community
dbaplus Community
dbaplus Community
Why a Custom Thread Pool Crashed Our Service: Deep Dive into Tomcat Thread Saturation

Incident Overview

On 27 Oct 2023 the core service bfe‑customer‑application‑query‑svc experienced a rapid increase in response time, causing zone‑2 to become unavailable. Each zone ran only two Pods, and each Pod used a Tomcat thread pool with a maximum of 200 worker threads.

Initial Diagnosis

Key observations:

The service had just been released and each zone contained only two Pods.

The confirmEvaluate flow was switched from 50 % to 100 % traffic the night before, doubling QPS (see Figure 1).

Latency spikes in the downstream getTagInfo interface raised response times from ~10 ms to ~500 ms.

These factors indicated a capacity‑shortage problem.

Tomcat Thread‑Pool Saturation

Monitoring showed that both Pods quickly reached Max Threads = 200 and the number of available threads dropped to zero (Figures 2‑3). When the pool is full, new tasks are queued, creating back‑pressure.

Tomcat’s pool extends java.util.concurrent.ThreadPoolExecutor. Its execution logic can be expressed as:

if (workerCount < corePoolSize) {
    addWorker(task, true);
} else if (workQueue.offer(task)) {
    // task queued
} else if (workerCount < maximumPoolSize) {
    addWorker(task, false);
}

The custom TaskQueue overrides offer() to prefer queuing while the pool is below its maximum size and to reject when the pool is saturated.

Two expansion stages emerge:

Stage 1 : workerCount < maximumPoolSize – new threads are created for each incoming task ("aggressive growth").

Stage 2 : workerCount == maximumPoolSize – tasks can only be queued, causing latency and possible time‑outs.

Health‑Check Failure Cascade

When the pool entered Stage 2, health‑check requests (Spring Boot /actuator/health) were forced to wait in the queue. In zone‑2 the queued health checks timed out after 1 s, triggering Kubernetes pod restarts. With only two Pods, both restarted, leaving the zone completely down.

Call‑Stack of the SOA Call

The service uses a custom SOA framework ( hermes ) that performs HTTP calls via CloseableHttpAsyncClient. The call flow is:

Submit request to asyncClient.execute(..., callback).

The callback implements FutureCallback with completed(), failed(), cancelled().

The invoking thread blocks on future.get() (no timeout), putting the thread into WAITING state.

Because future.get() blocks indefinitely, Tomcat worker threads remain occupied while the async client processes the request in separate IOReactorWorker threads.

Custom Thread‑Pool Bottleneck

The confirmEvaluate method creates two sub‑tasks and submits them to a custom pool BizExecutorsUtils.EXECUTOR:

Maximum pool size: 20 threads.

Work‑queue capacity: > 1000 tasks.

During the traffic surge each Pod received ~20 requests / s; each request spawns two sub‑tasks that each take ~1 s. The custom pool quickly saturated, leaving many tasks queued (Figure 25). Consequently the main Tomcat thread waited on future.get() for each sub‑task, exhausting the Tomcat pool and causing the health‑check backlog.

Quantitative Estimation

Before the latency spike each Pod handled ~18 requests / s. After the 50 % → 100 % traffic shift the request rate doubled to ~36‑40 requests / s. Each request generated two sub‑tasks, so the custom pool needed to process ~70‑80 tasks per second, far beyond its 20‑task‑per‑second capacity, leading to the "bottleneck effect".

Root Cause and Contributing Factors

Root cause : Misuse of a custom thread pool with insufficient capacity, creating a bottleneck.

Trigger : Downstream getTagInfo latency spike.

Secondary factors :

Traffic increase due to flow‑switch.

Only two Pods per zone.

Health‑check timeout shorter than worst‑case request latency.

User retries amplifying load.

Recommendations

Avoid custom thread pools for latency‑sensitive paths; use the container‑managed pool or a properly sized dynamic pool.

Monitor key finite resources (Tomcat thread pool, custom pool busy threads, connection pools) with alerts before saturation.

Perform load‑testing that includes realistic traffic spikes and downstream latency variations.

Maintain sufficient pod replicas for core services to absorb sudden load and avoid single‑point failures.

Set health‑check timeouts longer than the worst‑case request latency or make them independent of the main request path.

Figure 1: confirmEvaluate QPS
Figure 1: confirmEvaluate QPS
Figure 2: Tomcat thread pool usage (Pod‑1)
Figure 2: Tomcat thread pool usage (Pod‑1)
Figure 19: JsonRpcResponse definition
Figure 19: JsonRpcResponse definition
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Microservicescapacity planningthread poolTomcatPerformance debugging
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.