How I Turned a 3‑Day Latency Nightmare into a 30‑Second Debugging Tool
After a late‑night PagerDuty alert revealed a p95 latency over 5 seconds despite normal CPU, memory, and database metrics, the author spent three days tracing the issue to a tiny thread‑pool configuration, then built an open‑source CLI that automates the entire diagnosis in seconds.
Background
At 02:47 on a Tuesday the author received a PagerDuty alert: p95 latency > 5000 ms . Grafana showed normal CPU (40 %), memory (6 GB/8 GB), database CPU (30 %) and zero error rate, yet users experienced five‑second delays. After 72 hours of log digging, metric inspection, and consulting senior engineers, the root cause was identified as a exhausted connection‑pool thread pool that had been mis‑configured two years earlier.
Manual Debugging Process
Hours 1‑4: stared at Grafana dashboards, zoomed, changed time ranges, compared with the previous week – no obvious anomalies.
Hours 5‑8: downloaded logs, grepped for errors and slow queries – found nothing.
Hours 9‑16: hypothesised database, Redis, load‑balancer or network issues – all appeared healthy.
Hours 17‑24: searched Stack Overflow for “high latency low CPU”, “API slow but DB fast”, “random latency spikes”, etc. – most answers were irrelevant.
Hours 25‑48: asked senior engineers; suggestions included GC pauses, connection‑pool exhaustion, thread‑pool saturation, network loss, DNS problems.
Hours 49‑72: finally inspected Tomcat thread‑pool configuration and discovered max: 10 threads for a service handling 500 requests/minute.
Increasing the max threads to 200 instantly reduced latency.
Recurring Pattern
Random latency spikes appear.
Hours are spent watching dashboards.
Manual checks of common suspects are performed.
The root cause is usually one of ~20 known problems.
Fix takes about five minutes.
The issue is forgotten until it recurs.
Tool Development
Frustrated by repeated manual work, the author built a free CLI called Production Latency Debug Starter Kit . The tool connects to a Prometheus endpoint and automatically checks the most common latency culprits:
Thread‑pool saturation
Connection‑pool exhaustion
Connection leaks
Long‑tail latency (p99 vs p95)
Database slow queries
Cache issues (misses, stampedes)
GC pressure
Network timeouts
How It Works
The tool runs a series of checks against the supplied Prometheus metrics and reports any abnormal values.
latency-debug --prometheus-url http://localhost:9090 --service api-gatewaySample output:
✓ Thread usage: 45/200 (22%) – OK
⚠ DB connections: 98/100 (98%) – WARN
⚠ Connections >30 s: 15 – possible leak
⚠ p95: 450 ms | p99: 2300 ms – tail latency high
✓ Cache hit rate: 94% – OKReal‑World Usage
Two weeks after releasing the tool, another latency spike (p95 = 4200 ms) occurred. Running the CLI produced:
⚠ DB connections: 195/200 (97%) – SEVERE
⚠ Connections >60 s: 47 – confirmed leak
⚠ Slow query: SELECT * FROM users … (avg = 1200 ms)The author fixed the missing index and the connection‑leak, and latency dropped back to normal within ten minutes.
Lessons Learned
Most production latency issues belong to a small set of patterns (thread‑pool, connection‑pool, slow queries, leaks, cache problems, GC pauses, network timeouts, resource contention).
Systematically checking these six areas resolves ~95 % of incidents.
Automation eliminates fatigue, forgetfulness, and inconsistent manual thresholds.
Most Frequent Issues (by occurrence)
Connection‑pool exhaustion (≈ 40 %)
Thread‑pool saturation (≈ 25 %)
Slow queries lacking proper indexes (≈ 15 %)
Connection leaks (≈ 10 %)
Cache invalidation problems (≈ 5 %)
Other miscellaneous issues (≈ 5 %)
Spring Boot Specific Findings
HikariCP default max connections = 10 – usually insufficient.
Tomcat default max threads = 200 – can be exhausted quickly under I/O load.
Missing @Transactional leads to connection leaks.
N+1 query problems from lazy loading cause massive request spikes.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
