Analysis of TCP Connection Failures Caused by ARP Queue Length (unres_qlen) in Linux Kernels
The article investigates intermittent TCP connection failures during application server startup caused by the Linux kernel ARP queue length parameter unres_qlen, reproduces the issue with a concurrent connection test, analyzes kernel internals, and recommends increasing unres_qlen for kernels prior to 3.3.
Background: In a production environment we observed that when an application server starts and creates a connection pool to a backend database, some connections occasionally fail to establish. Investigation revealed the issue is related to the kernel ARP parameter unres_qlen.
Reproduction environment: OS RHEL 6.6, kernel 2.6.32-504.el6.x86_64. A test program runs on a client machine (10.0.0.102) that concurrently initiates 16 TCP connections with a 500 ms timeout to a server (10.0.0.101).
Phenomenon: After clearing the ARP cache on the client, only three of the sixteen connections succeed; the remaining thirteen time out. Re‑running the test after the failure clears the ARP cache again reproduces the timeout, while subsequent runs succeed.
Problem analysis: Packet capture on the server shows that only three SYN packets are received; the other thirteen never appear. Dropwatch logs indicate that the kernel function __neigh_set_probe_once is invoked 13 times, matching the failed connections. The function discards packets when the ARP queue length exceeds neigh->parms->queue_len, which is derived from the sysctl net.ipv4.neigh.*.unres_qlen.
Kernel parameter details: neigh/default/unres_qlen defines the maximum number of packets queued for each unresolved address (default 31 in modern kernels, deprecated value 3 before Linux 3.3). When the queue is full, additional SYN packets are dropped, causing TCP retransmission timeouts.
TCP connection establishment process:
1) Application sends SYN.
2) IP layer performs routing.
3) ARP layer queries the next‑hop MAC address; if no ARP entry exists, the SYN is placed in the ARP queue (limited by unres_qlen) and an ARP request is sent.
4) Upon ARP reply, the queued SYN is transmitted.
With unres_qlen set to 3, concurrent connections exceeding this limit lose their SYN packets, leading to timeout failures.
Conclusion: In scenarios where applications open many simultaneous TCP connections (e.g., database connection pools) and use short connection‑timeout settings, the default unres_qlen value can cause sporadic connection failures. For kernels earlier than 3.3, increasing unres_qlen (e.g., to 64) resolves the issue.
References:
Understanding RTT impact on TCP retransmissions
Linux kernel IP sysctl documentation
Additional note: The problem can also be reproduced by sending a large ping packet, which triggers the same ARP queue overflow.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
