Why Did Our Backend Freeze? A Deep Dive into Connection‑Pool Exhaustion and Slow SQL
A detailed post‑mortem of a three‑time service outage reveals how hidden bugs, frequent FullGC, a saturated connection pool, and an unindexed slow SQL query crippled a Spring Boot backend, and shows the step‑by‑step troubleshooting, temporary fixes, and lasting improvements applied.
First Investigation
Problem Identification
1. Log into the website to confirm the outage. Front‑end resources responded quickly, but back‑end requests remained pending.
2. Open the container platform to check service status; average response time was about 21 seconds.
3. QPS, memory, and CPU looked normal, so the issue was not load‑related.
4. The monitoring platform showed an average response time of 16.2 seconds per minute.
5. JVM monitoring revealed a Full GC every five minutes, which paused the application.
6. Thread‑pool monitoring showed all threads busy and many queued.
7. Database‑connection‑pool monitoring indicated the pool was full.
Temporary Fix
Increasing the HikariCP maximum pool size to 20 in application.yml restored service quickly.
spring:
hikari:
maximum-pool-size: 20Second Investigation
After the first fix, the service stalled again. The connection pool again reached its limit, and jstack showed many threads in TIMED_WAITING, mirroring the first incident.
The temporary solution was to redeploy, clearing the pool.
Third Investigation
Senior engineer suggested checking for slow SQL. A query executed over 7,000 times with an average of 1.4 seconds was identified.
The query lacked an index on the scene column, causing a full‑table scan during each WeChat QR‑code login poll.
Adding an index on scene immediately restored normal response times and reduced connection‑pool usage.
Explain plans confirmed the query now used the index.
Key Takeaways
When a service stalls, first add capacity (e.g., a new instance) before deep investigation.
Increasing the DB connection pool can be a quick fix, but the default size may be too conservative.
Temporary fixes do not solve root causes; continuous monitoring is essential.
Understanding how to locate thread‑pool saturation and slow SQL is critical for efficient troubleshooting.
Ultimately, the incident highlighted insufficient experience among developers and the importance of indexing frequently queried columns, proper connection‑pool sizing, and proactive performance testing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
