Backend Development 8 min read

Root Cause Analysis of OOM Caused by Misused HttpClient evictExpiredConnections and Keep‑Alive Issues

The article recounts a production OOM incident triggered by a misconfigured HttpClient evictExpiredConnections setting that caused uncontrolled thread growth, explains the underlying keep‑alive TCP behavior and NoHttpResponseException, and presents the corrective measures of using a singleton HttpClient and proper connection‑pool monitoring.

Full-Stack Internet Architecture
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Root Cause Analysis of OOM Caused by Misused HttpClient evictExpiredConnections and Keep‑Alive Issues

Yesterday night an internal APM system generated a flood of alerts indicating that four production machines were out of memory (OOM) and the service became unavailable. Operations restarted the machines, but the logs confirmed OOM as the root cause.

Investigation revealed a rapid increase in the number of threads, reaching around 30,000, far exceeding the normal ~600. The spike coincided with a recent deployment that added the evictExpiredConnections configuration to HttpClient initialization.

The added configuration was intended to address frequent NoHttpResponseException errors by proactively evicting idle connections. However, each request created a new HttpClient instance, and each instance started its own background thread for eviction, leading to a massive number of threads and eventual OOM.

To understand the problem, the article explains the TCP three‑way handshake, four‑way termination, and the importance of HTTP keep‑alive (connection reuse). While keep‑alive reduces the overhead of repeatedly establishing TCP connections, it can also cause idle connections to occupy resources if not timed out properly.

When a server closes an idle connection with a FIN, a client that continues to reuse the same TCP socket may receive an RST, resulting in NoHttpResponseException . Two mitigation strategies are suggested: retrying the request a few times, or periodically cleaning up idle connections.

The evictExpiredConnections option implements the second strategy by launching a cleanup thread. The correct usage is illustrated below:

Makes this instance of HttpClient proactively evict idle connections from the connection pool using a background thread.

Because the application created a new HttpClient per request, each request also started a new cleanup thread, causing the thread explosion and OOM across all four machines (which had identical load balancer weights).

The resolution involved refactoring the code to use a singleton HttpClient, ensuring only one cleanup thread runs, and adding monitoring to alert when thread counts exceed a safe threshold. This prevented further OOM incidents and highlighted the critical role of comprehensive monitoring.

In summary, the case demonstrates the need for deep understanding of third‑party libraries, network fundamentals, and proactive monitoring to effectively troubleshoot performance and reliability issues in backend services.

backendJavaperformanceKeep-AliveoomHttpClient
Full-Stack Internet Architecture
Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.