Operations 17 min read

How I Turned a 3‑Day Latency Nightmare into a 30‑Second Debugging Tool

After a late‑night PagerDuty alert revealed a p95 latency over 5 seconds despite normal CPU, memory, and database metrics, the author spent three days tracing the issue to a tiny thread‑pool configuration, then built an open‑source CLI that automates the entire diagnosis in seconds.

DevOps Coach

Jan 21, 2026

How I Turned a 3‑Day Latency Nightmare into a 30‑Second Debugging Tool

Background

At 02:47 on a Tuesday the author received a PagerDuty alert: p95 latency > 5000 ms . Grafana showed normal CPU (40 %), memory (6 GB/8 GB), database CPU (30 %) and zero error rate, yet users experienced five‑second delays. After 72 hours of log digging, metric inspection, and consulting senior engineers, the root cause was identified as a exhausted connection‑pool thread pool that had been mis‑configured two years earlier.

Manual Debugging Process

Hours 1‑4: stared at Grafana dashboards, zoomed, changed time ranges, compared with the previous week – no obvious anomalies.

Hours 5‑8: downloaded logs, grepped for errors and slow queries – found nothing.

Hours 9‑16: hypothesised database, Redis, load‑balancer or network issues – all appeared healthy.

Hours 17‑24: searched Stack Overflow for “high latency low CPU”, “API slow but DB fast”, “random latency spikes”, etc. – most answers were irrelevant.

Hours 25‑48: asked senior engineers; suggestions included GC pauses, connection‑pool exhaustion, thread‑pool saturation, network loss, DNS problems.

Hours 49‑72: finally inspected Tomcat thread‑pool configuration and discovered max: 10 threads for a service handling 500 requests/minute.

Increasing the max threads to 200 instantly reduced latency.

Recurring Pattern

Random latency spikes appear.

Hours are spent watching dashboards.

Manual checks of common suspects are performed.

The root cause is usually one of ~20 known problems.

Fix takes about five minutes.

The issue is forgotten until it recurs.

Tool Development

Frustrated by repeated manual work, the author built a free CLI called Production Latency Debug Starter Kit . The tool connects to a Prometheus endpoint and automatically checks the most common latency culprits:

Thread‑pool saturation

Connection‑pool exhaustion

Connection leaks

Long‑tail latency (p99 vs p95)

Database slow queries

Cache issues (misses, stampedes)

GC pressure

Network timeouts

How It Works

The tool runs a series of checks against the supplied Prometheus metrics and reports any abnormal values.

latency-debug --prometheus-url http://localhost:9090 --service api-gateway

Sample output:

✓ Thread usage: 45/200 (22%) – OK
⚠ DB connections: 98/100 (98%) – WARN
⚠ Connections >30 s: 15 – possible leak
⚠ p95: 450 ms | p99: 2300 ms – tail latency high
✓ Cache hit rate: 94% – OK

Real‑World Usage

Two weeks after releasing the tool, another latency spike (p95 = 4200 ms) occurred. Running the CLI produced:

⚠ DB connections: 195/200 (97%) – SEVERE
⚠ Connections >60 s: 47 – confirmed leak
⚠ Slow query: SELECT * FROM users … (avg = 1200 ms)

The author fixed the missing index and the connection‑leak, and latency dropped back to normal within ten minutes.

Lessons Learned

Most production latency issues belong to a small set of patterns (thread‑pool, connection‑pool, slow queries, leaks, cache problems, GC pauses, network timeouts, resource contention).

Systematically checking these six areas resolves ~95 % of incidents.

Automation eliminates fatigue, forgetfulness, and inconsistent manual thresholds.

Most Frequent Issues (by occurrence)

Connection‑pool exhaustion (≈ 40 %)

Thread‑pool saturation (≈ 25 %)

Slow queries lacking proper indexes (≈ 15 %)

Connection leaks (≈ 10 %)

Cache invalidation problems (≈ 5 %)

Other miscellaneous issues (≈ 5 %)

Spring Boot Specific Findings

HikariCP default max connections = 10 – usually insufficient.

Tomcat default max threads = 200 – can be exhausted quickly under I/O load.

Missing @Transactional leads to connection leaks.

N+1 query problems from lazy loading cause massive request spikes.

Connection Pool Thread Pool Grafana latency debugging

Written by

DevOps Coach

Master DevOps precisely and progressively.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Manual Debugging Process

Recurring Pattern

Tool Development

How It Works

Real‑World Usage

Lessons Learned

Most Frequent Issues (by occurrence)

Spring Boot Specific Findings

DevOps Coach

How this landed with the community

Was this worth your time?

0 Comments

Spring Boot Specific Findings