How a GC, Thread Pool, and Slow SQL Combo Crippled a Java Service – Deep Postmortem & Fixes
A real‑world production incident where GC pauses, thread‑pool exhaustion, and slow SQL combined to drop QPS from 3000 to 1400 and inflate response times from 200 ms to over 2 s, with detailed analysis, diagnostic criteria, and step‑by‑step optimizations that restored performance.
Incident Background
Service architecture: Spring Boot with embedded Tomcat
JVM heap: 8 GB
Server: 16 CPU / 32 GB RAM
Deployment: single instance
Symptom Summary
CPU usage: 35%–45%
Load average: 5–7
Memory: sufficient
QPS: 3000 → 1400
Response time (RT): 200 ms → 2 s+
Error rate: essentially zero
System‑level metrics look normal, but the business is clearly unavailable.
1️⃣ GC Dimension – Was STW stealing time?
Key metrics to watch (not just heap size)
Minor GC count – abnormal frequency?
GC pause duration – any pause > 200 ms?
Total STW time – does it coincide with QPS drop?
Old generation usage – steady increase?
Observed signals in this incident
Minor GC count spiked dramatically
GC pauses of 1–3 seconds
QPS sharply fell during GC peaks
<code>[GC pause (Allocation Failure) (young) 2.94s]</code>
Diagnosis criteria
STW pause > 1 s and frequent
QPS tightly correlated with GC pause
CPU not high but RT noticeably elongated
GC Optimizations (directly applicable)
Original JVM options (problematic version):
-Xms8g
-Xmx8g
-XX:+UseConcMarkSweepGCOptimized JVM options:
-Xms12g
-Xmx12g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200Code‑level tweaks:
Reduce temporary object creation
Avoid allocating large objects on hot paths
Result after GC tuning
STW pause reduced from 1–3 s to < 200 ms
QPS recovered from 1400 to > 2700
2️⃣ Thread‑Pool Dimension – Are slow requests exhausting threads?
Metrics to monitor
Active thread count – constantly near max?
Queue length – any queuing observed?
Request wait time – does it grow with concurrency?
Observed signals
Tomcat currentThreadsBusy long‑term near max
RT increased linearly with concurrency
New requests clearly queued
Thread stack (jstack) showed many threads blocked in DB calls:
java.sql.PreparedStatement.executeThread blocked by DB
Diagnosis criteria
CPU low while RT high
Thread pool saturated
RT grows with load
jstack shows many threads waiting on I/O/DB
Thread‑pool optimization steps
SQL optimization (key): SELECT * FROM `order` WHERE user_id = ? After adding an index:
CREATE INDEX idx_user_id_status ON `order`(user_id, status);Thread‑pool isolation configuration:
# Core interface thread pool
core-pool-size: 150
# Non‑core interface thread pool
async-pool-size: 50SQL timeout set to 1 s
Fast‑fail for slow requests
Result after thread‑pool tuning
Thread pool changed from constantly full to stable
RT improved from 1–2 s to ~200 ms
3️⃣ DB Dimension – Are hidden slow SQLs the bottleneck?
Metrics to watch
Number of slow SQLs – sudden increase?
Query latency – > 1 s?
Active connection count – steady rise?
Observed DB behavior
Slow‑query log snippets:
Query_time: 4.3s
Query_time: 5.1sFew slow queries, but they appear on high‑concurrency paths
Frequently invoked, causing thread blockage
Diagnosis criteria
QPS drops while CPU stays normal
Threads waiting on DB
Prefer suspecting DB issues over application logic when these signs appear.
4️⃣ Integrated Troubleshooting Flow
QPS ↓
↓
Is GC causing STW pauses?
↓
Is the thread pool exhausted?
↓
Is the DB presenting slow SQL?Never reverse the order.
5️⃣ Why CPU can be misleading
CPU only answers: “Is anyone using me for computation?”
It cannot tell you if the thread is paused by GC
It cannot tell you if the thread is waiting on DB
It cannot tell you if the thread is blocked on a lock
6️⃣ Takeaway
Java performance problems: 80% are not about raw compute power, but about pauses, waiting, and blocking.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
