Why Our Core Service Crashed: Tomcat Thread Pool Bottlenecks & Custom Executor Pitfalls
On October 27, 2023, a sudden surge in request volume and downstream latency caused the bfe‑customer‑application‑query‑svc to exhaust its Tomcat thread pool, triggering health‑check failures and pod restarts; a deep dive revealed that an ill‑designed custom executor and unchecked thread‑waiting calls created a bottleneck that amplified the outage.
1. Rescue
On Friday, October 27, 2023, while commuting to work, I received an urgent Feishu alert around 08:50. The core service bfe‑customer‑application‑query‑svc experienced a rapid RT increase, and the entire zone‑2 became unavailable. After intense emergency handling—rate limiting, fallback configuration, scaling—similar to CPR for a patient, the service was restored after several minutes and zone‑2 fully recovered.
2. Diagnosing the Root Cause
The "patient" was temporarily saved, but the cause remained unknown and could recur. Upon arrival at the office, I began tracing the fault.
Useful information gathered during the emergency:
bfe‑customer‑application‑query‑svc was a newly deployed application; each zone had only 2 Pod instances.
The previous night, the confirm‑evaluate interface (a second‑stage pricing API) was switched from 50% to 100% traffic. During the incident, its QPS was about twice the previous day's same‑time level (see Figure 1).
Preliminary guess: a downstream interface getTagInfo experienced timeout jitter, causing RT to rise to 500 ms (normal P95 ≈ 10 ms) for about 1 s.
These clues point to a "capacity shortage"—but which resource? CPU, memory, threads, network connections, etc.? A deeper analysis was required.
2.1 Initial Symptom: Tomcat Thread‑Pool Saturation
Pod monitoring quickly ruled out CPU or memory exhaustion. The application does not use a database or connection pool for synchronous I/O, so the most likely culprit is the worker threads.
Tomcat thread‑pool metrics (see Figures 2 and 3) show that Pod‑1 reached the maximum available threads at 08:54:00, and Pod‑2 did so earlier at 08:53:30.
Understanding the thread‑pool expansion logic is essential.
Metric
Code Definition
Available Threads
getPoolSize()
Max Threads
getMaximumPoolSize()
Tomcat’s thread pool ( org.apache.tomcat.util.threads.ThreadPoolExecutor) extends java.util.concurrent.ThreadPoolExecutor and reuses its execute method (see Figure 4). The expansion steps are:
Line 25‑26: If current workers < corePoolSize, create a new thread.
Line 30: Attempt to enqueue the task into workQueue.
Line 37: If enqueue fails, try to add a new worker thread ( addWorker(command, false)).
The workQueue implementation is a custom TaskQueue (subclass of LinkedBlockingQueue) with an overridden offer method (Figure 5):
If the pool has reached max threads, the task is directly queued.
If the pool has not reached max threads and submitted task count < current workers, the task is queued (will be taken immediately by an idle thread).
Otherwise, if workers < max, offer returns false, causing execute to expand the pool.
When the pool enters "Stage 2" (workers = max), it can no longer create new threads; incoming tasks queue up, leading to request backlog. Figures 2 and 3 confirm that both Pods entered this saturated state, causing request queuing.
2.2 Consequence of Saturation: Queued Health‑Check Requests Trigger Pod Restarts
The service uses SpringBoot’s health‑check endpoint actuator/health. When the thread pool is saturated, health‑check requests also wait in the queue (see Figure 7). The health‑check timeout is 1 s; queued requests exceed this, causing probe failures and Kubernetes‑initiated pod restarts (Figures 8 and 9). With only two Pods in zone‑2, their successive restarts rendered the zone completely unavailable.
2.3 Deep Dive: Tracing the Degradation of Thread‑Pool Capacity
Tomcat thread saturation was identified as the key issue, but why did the pool saturate? The hypothesis was that the downstream getTagInfo latency caused Tomcat threads to wait, exhausting them.
However, thread‑state monitoring (Figures 14 and 15) showed a spike in WAITING threads, not TIMED_WAITING as expected for a simple downstream latency.
Stack traces (Figure 16) revealed that a typical SOA call uses Future.get() without a timeout, which puts the thread into WAITING. The call flow (Figures 17‑21) shows that the async HTTP client returns immediately, and the Tomcat thread blocks on future.get(Integer.MAX_VALUE, TimeUnit.MILLISECONDS), entering TIMED_WAITING. Yet the observed WAITING indicated additional blocking code.
Further investigation uncovered that the business method confirmEvaluate creates two sub‑tasks submitted to a custom executor BizExecutorsUtils.EXECUTOR (Figure 23). This executor has a max of 20 threads and a large queue (>1 K). The confirm‑evaluate method calls future.get() without timeout, causing WAITING threads.
The custom executor creates a "bottleneck effect" (Figure 25): Tomcat can handle up to 200 concurrent requests, but the executor can only process ~20 tasks per second. When traffic doubled after the 50%→100% switch and getTagInfo latency spiked, the executor’s queue grew, and future.get() blocked Tomcat threads, leading to rapid thread‑pool exhaustion.
Trace logs (Figure 26) showed a ~6 s delay between the upstream get_user_info call and the start of the sub‑tasks, confirming the queue wait time.
2.4 Diagnosis Conclusion
Root Cause
Business code misuse of a custom thread pool causing a "bottleneck effect".
Trigger
Downstream getTagInfo latency jitter.
Other Contributing Factors
Traffic increase due to flow switch.
Few Pods (only 2 per zone).
Fragile health‑check mechanism.
User retries amplifying traffic.
Conclusion
Custom thread‑pool misuse limited Tomcat throughput. After the flow switch on Oct 26, request volume doubled, pushing the custom executor near its 20 tasks/s limit. The subsequent getTagInfo latency spike caused task execution time to rise, quickly collapsing the executor, saturating Tomcat, queuing health‑check probes, and triggering pod restarts that made zone‑2 unavailable.
3. Recommendations
3.1 Ensure Stability Budgets and Proper Capacity Planning for I/O‑Intensive Services
Deploying only two Pods for a core service is the minimal redundancy and risky for I/O‑bound workloads. Capacity should be evaluated with load‑testing, monitoring of thread pools, connection pools, and bandwidth, not just CPU usage.
3.2 Validate Performance Assumptions with Load Tests
Even if a feature works in production, load‑testing can reveal hidden bottlenecks such as custom executor limits. Assumptions like "50% traffic ran fine, 100% will be fine" are unsafe without performance verification.
3.3 Monitor Finite Resources
Track thread‑pool saturation, connection‑pool usage, and other limited resources. Early alerts on high busy‑thread ratios could have warned of the impending crisis during the 50% flow switch.
4. Closing Remarks
This article concludes the series, offering a deep technical post‑mortem of a real‑world incident. It emphasizes the importance of thorough capacity planning, proper monitoring, and disciplined performance testing to avoid similar outages.
For earlier articles in the series, see the links below:
Article Link
Deep Analysis: GC Issues Caused by Large Object Allocation
Thread‑Pool Reject Issue Root Cause Analysis
Resolving Core Service Unavailability After Primary DB Migration
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
