How a Misconfigured HttpClient Connection Pool Triggered a Snowballing Thread Crash
The article recounts a real‑world incident where a Java HttpClient connection‑pool misconfiguration limited each host to only two connections, causing massive thread queuing, CPU spikes, and ultimately a cascade of instance crashes during traffic scaling.
1. Event Background
I was maintaining an online real‑time API service that started showing "Address already in use (Bind failed)" errors due to a large number of TIME_WAIT connections occupying ports (up to 60,000). To reduce port exhaustion, I introduced a connection pool for HttpClient.
However, the pool introduced a new problem.
2. Problem Process
Based on peak traffic of 12,000 requests per minute and an average response time of 1.3 s, the estimated QPS was about 260. Observing logs showed each connection establishment took roughly 1.1 s, so we set the maximum pool size to around 500.
After deploying the MultiThreadedHttpConnectionManager in a small‑scale test, performance improved, but once the change was rolled out to the whole Beijing data center, the system began to fail.
3. Incident Review
Following the traffic switch, users reported that the live page could not be opened. Monitoring showed normal overall traffic but a sharp increase in response time, no obvious service‑side timeouts, and several instances crashing.
Monitoring platform showed normal business traffic but a slight surge in network card traffic on some machines.
Response time increased noticeably.
No clear errors in business logs.
9 out of 30 instances (6 in Beijing, 3 in Nanjing) were dead.
4. Deep Investigation
CPU usage of Java processes was nearly ten times higher than usual, and thread counts spiked beyond the container limit of 2000, causing the virtualization platform to kill the instances.
Rolling back a subset of instances reduced TCP connection concurrency dramatically, confirming the connection‑pool settings as the root cause.
JStack logs revealed many threads waiting for a connection from the pool, leading to thread accumulation, higher response times, and a vicious cycle that eventually exhausted thread limits.
Further source‑code analysis showed that the MultiThreadedHttpConnectionManager checks both maxTotalConnections and maxHostConnections. The default maxHostConnections is 2 unless setDefaultMaxConnectionsPerHost is used.
Because the per‑host limit remained at its default of 2, each request could only open two concurrent connections, causing the observed bottleneck.
5. Incident Summary
Connection‑pool parameter misconfiguration limited each host to 2 connections. Numerous request threads queued for a pool connection, causing thread buildup. Thread buildup increased response time and resource consumption, further aggravating the queue. Thread count exceeded limits, and the virtualization platform killed the instances. Failed instances forced traffic to surviving instances, creating a snowball effect.
To avoid similar issues, the author suggests three preventive measures:
Thoroughly read official documentation before upgrading any technology.
Reference high‑quality open‑source projects for best‑practice implementations.
Conduct offline load testing with controlled variables to expose problems early.
Additionally, a detailed load‑testing plan is proposed to compare performance with and without connection pooling, with and without setDefaultMaxConnectionsPerHost, and by adjusting total and per‑host connection limits, while monitoring thread count, CPU usage, TCP connections, port usage, and memory.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
