Debugging Java Outages: HikariCP Thread Pool, CPU Load & Connection Timeouts

After an Alibaba Cloud RDS host failure caused a HA switch, the bme‑trade‑order‑svc service experienced prolonged unavailability; this article dissects how thread‑pool saturation, HikariCP connection‑pool mechanics, and CPU load interplay, and outlines systematic investigations that pinpointed thread waiting, pod restarts, and CPU throttling as root causes.

Huolala Tech
Huolala Tech
Huolala Tech
Debugging Java Outages: HikariCP Thread Pool, CPU Load & Connection Timeouts

1. Background

On August 30 at around 15:50, an Alibaba Cloud RDS host failure triggered an HA switch for the core order database mysql.order_sharding . Several applications depending on this database became unavailable, repeating a pattern seen in the past six months. The incident timeline provided by NOC is shown in the table below.

Time

Operation

Duration

15:51:03 - 15:51:48

Database completed HA switch

38

seconds (expected)

Database recovery - 15:55

Transaction applications recovered

>3

minutes

The recovery time exceeding three minutes indicated a serious issue. A special investigation team was formed to find the root cause.

2. Information Gathering

Key observations:

After HA, the order‑related application bme‑trade‑order‑svc showed a CPU load spike; this service is heavily used by upstream applications.

During the database failure, users retried requests, creating many Tomcat worker threads. Once the database recovered, the massive number of threads caused sustained high CPU usage, leading to service unavailability.

2.1 Does retry always create many worker threads?

Not necessarily. In a typical full‑link pressure test at 1.5× normal traffic, the number of worker threads only rose slightly, peaking at 38, far below the max of 200.

Thread‑pool reuse keeps the thread count low unless threads are blocked for a long time on high‑frequency operations.

2.2 Does a high thread count always cause CPU overload?

No. Most threads are in waiting states, releasing CPU resources. The configured max of 200 threads is well within the CPU capacity.

3. Finding Clues

3.1 Thread count surge due to waiting?

Yes. Java has six thread states; three represent waiting: BLOCKED, WAITING, and TIMED_WAITING. The bme‑trade‑order‑svc threads were mostly in TIMED_WAITING waiting for database connections.

HikariCP uses a SynchronousQueue (hand‑off queue) to hand connections directly to waiting threads, avoiding busy‑waiting loops.

3.3 Is CPU load caused by thread surge?

No. The CPU usage dropped while the thread count peaked, confirming that waiting threads do not consume CPU.

Later, after a period of zero active threads, CPU usage spiked again due to pod restarts and lack of warm‑up.

4. Can We Recover Faster?

4.1 Adjust Connection‑Pool Timeout

The HikariCP connectionTimeout (time waiting for a connection) should be short (e.g., 1 s). Long timeouts hide underlying pool exhaustion problems.

4.2 Improve Liveness Probe Design

Using actuator/health as a liveness probe ties health checks to the same Tomcat thread pool as business requests. When the pool is exhausted, the probe fails and Kubernetes restarts the pod. Switching to a lightweight exec probe (e.g., kill -0 PID) or a TCP socket probe avoids unnecessary restarts.

livenessProbe:
  exec:
    command:
    - "bash"
    - "-c"
    - "kill -0 $(cat /var/run/app.pid) > /dev/null 2>&1"
  initialDelaySeconds: 5
  periodSeconds: 5

4.3 Warm‑up Java Applications

Cold start issues can be mitigated by:

Using CRaC (Coordinated Restore at Checkpoint) to snapshot a warmed JVM.

Running synthetic traffic or targeted warm‑up calls to trigger JIT compilation.

Gradually ramping real traffic (e.g., 1 %, 10 %, …) to allow caches and pools to fill.

5. Conclusion

The outage was caused by a combination of thread waiting for DB connections, pod restarts due to aggressive liveness probes, and CPU throttling after warm‑up. Properly configuring HikariCP timeouts, redesigning health probes, and improving application warm‑up can significantly reduce recovery time and prevent similar incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaKubernetesThreadPoolHikariCP
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.