Operations 8 min read

Analysis of TCP Connection Failures Caused by ARP Queue Length (unres_qlen) in Linux Kernels

The article investigates intermittent TCP connection failures during application server startup caused by the Linux kernel ARP queue length parameter unres_qlen, reproduces the issue with a concurrent connection test, analyzes kernel internals, and recommends increasing unres_qlen for kernels prior to 3.3.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Analysis of TCP Connection Failures Caused by ARP Queue Length (unres_qlen) in Linux Kernels

Background: In a production environment we observed that when an application server starts and creates a connection pool to a backend database, some connections occasionally fail to establish. Investigation revealed the issue is related to the kernel ARP parameter unres_qlen .

Reproduction environment: OS RHEL 6.6, kernel 2.6.32-504.el6.x86_64. A test program runs on a client machine (10.0.0.102) that concurrently initiates 16 TCP connections with a 500 ms timeout to a server (10.0.0.101).

Phenomenon: After clearing the ARP cache on the client, only three of the sixteen connections succeed; the remaining thirteen time out. Re‑running the test after the failure clears the ARP cache again reproduces the timeout, while subsequent runs succeed.

Problem analysis: Packet capture on the server shows that only three SYN packets are received; the other thirteen never appear. Dropwatch logs indicate that the kernel function __neigh_set_probe_once is invoked 13 times, matching the failed connections. The function discards packets when the ARP queue length exceeds neigh->parms->queue_len , which is derived from the sysctl net.ipv4.neigh.*.unres_qlen .

Kernel parameter details: neigh/default/unres_qlen defines the maximum number of packets queued for each unresolved address (default 31 in modern kernels, deprecated value 3 before Linux 3.3). When the queue is full, additional SYN packets are dropped, causing TCP retransmission timeouts.

TCP connection establishment process:

1) Application sends SYN.

2) IP layer performs routing.

3) ARP layer queries the next‑hop MAC address; if no ARP entry exists, the SYN is placed in the ARP queue (limited by unres_qlen ) and an ARP request is sent.

4) Upon ARP reply, the queued SYN is transmitted.

With unres_qlen set to 3, concurrent connections exceeding this limit lose their SYN packets, leading to timeout failures.

Conclusion: In scenarios where applications open many simultaneous TCP connections (e.g., database connection pools) and use short connection‑timeout settings, the default unres_qlen value can cause sporadic connection failures. For kernels earlier than 3.3, increasing unres_qlen (e.g., to 64) resolves the issue.

References:

Understanding RTT impact on TCP retransmissions

Linux kernel IP sysctl documentation

Additional note: The problem can also be reproduced by sending a large ping packet, which triggers the same ARP queue overflow.

TCPlinuxNetworkingconnection poolingARPkernel parametersunres_qlen
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.