Backend Development 10 min read

Root Cause Analysis of Repeated Backend Service Outages: Connection Pool Exhaustion and Slow SQL

During a Monday incident where the backend service became unavailable three times, the author details step-by-step investigations—checking frontend responses, container status, JVM and thread pool metrics, and database connection pool usage—ultimately identifying a slow, unindexed SQL query that exhausted connections, and describes the remediation and lessons learned.

IT Services Circle

Jul 28, 2024

Root Cause Analysis of Repeated Backend Service Outages: Connection Pool Exhaustion and Slow SQL

In this post the author, a developer named 鱼皮, recounts a recent incident where the backend service was unavailable three times in a single day, with the longest outage lasting forty minutes.

First Investigation

Locating the Problem

Initial checks showed the frontend assets loaded quickly while the backend requests remained pending. Monitoring the container platform revealed an average response time of 21 seconds, but QPS, CPU, and memory metrics were normal.

Further inspection of the interface monitoring platform showed a per‑minute response time of 16.2 s, prompting a deeper dive into JVM metrics.

JVM Monitoring

The JVM console displayed a FullGC occurring every five minutes, causing frequent pauses.

Thread‑Pool Monitoring

Thread‑pool metrics indicated that all threads were active and many were queued, suggesting a blockage.

Database Connection‑Pool Monitoring

Database connection‑pool graphs showed the pool was completely exhausted, hinting that requests were waiting for connections.

Temporary Fix

To quickly restore service, the author increased the maximum HikariCP pool size to 20 in application.yml:

spring:
  hikari:
    maximum-pool-size: 20

After redeploying, the service became responsive.

Second Investigation

Shortly after, the service stalled again. The connection‑pool again filled up, and thread dumps showed many threads in TIMED_WAITING state, offering no clear root cause.

Third Investigation

Consulting a senior teammate revealed a slow SQL query executed over 7,000 times with an average latency of 1.4 s. The query lacked an index on the scene column, causing full‑table scans during each poll for WeChat login status.

Solution

An index was added to the scene column, instantly reducing connection‑pool usage and restoring normal performance. Explain plans confirmed the query now uses the index.

The author reflects on the importance of monitoring slow SQL, proper indexing, and having sufficient connection‑pool capacity, as well as the need for automated scaling based on resource utilization.

Summary and Lessons Learned

When a service hangs, quickly add capacity (e.g., a new instance) before deep investigation.

Increasing the DB connection pool can be a stop‑gap, but understanding the underlying cause is essential.

Never stop investigating after a temporary fix; continue monitoring for hidden issues.

Familiarize yourself with troubleshooting patterns for thread‑pool exhaustion and connection‑pool saturation.

Overall, the incident highlights the need for proactive performance monitoring, proper indexing of frequently queried fields, and readiness to scale resources automatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend debugging Connection Pool slow-sql

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.