Operations 8 min read

Analysis of TCP Connection Failures Caused by ARP Queue Length (unres_qlen) in Linux Kernels

The article investigates intermittent TCP connection failures during application server startup caused by the Linux kernel ARP queue length parameter unres_qlen, reproduces the issue with a concurrent connection test, analyzes kernel internals, and recommends increasing unres_qlen for kernels prior to 3.3.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Analysis of TCP Connection Failures Caused by ARP Queue Length (unres_qlen) in Linux Kernels

Background: In a production environment we observed that when an application server starts and creates a connection pool to a backend database, some connections occasionally fail to establish. Investigation revealed the issue is related to the kernel ARP parameter unres_qlen.

Reproduction environment: OS RHEL 6.6, kernel 2.6.32-504.el6.x86_64. A test program runs on a client machine (10.0.0.102) that concurrently initiates 16 TCP connections with a 500 ms timeout to a server (10.0.0.101).

Phenomenon: After clearing the ARP cache on the client, only three of the sixteen connections succeed; the remaining thirteen time out. Re‑running the test after the failure clears the ARP cache again reproduces the timeout, while subsequent runs succeed.

Problem analysis: Packet capture on the server shows that only three SYN packets are received; the other thirteen never appear. Dropwatch logs indicate that the kernel function __neigh_set_probe_once is invoked 13 times, matching the failed connections. The function discards packets when the ARP queue length exceeds neigh->parms->queue_len, which is derived from the sysctl net.ipv4.neigh.*.unres_qlen.

Kernel parameter details: neigh/default/unres_qlen defines the maximum number of packets queued for each unresolved address (default 31 in modern kernels, deprecated value 3 before Linux 3.3). When the queue is full, additional SYN packets are dropped, causing TCP retransmission timeouts.

TCP connection establishment process:

1) Application sends SYN.

2) IP layer performs routing.

3) ARP layer queries the next‑hop MAC address; if no ARP entry exists, the SYN is placed in the ARP queue (limited by unres_qlen) and an ARP request is sent.

4) Upon ARP reply, the queued SYN is transmitted.

With unres_qlen set to 3, concurrent connections exceeding this limit lose their SYN packets, leading to timeout failures.

Conclusion: In scenarios where applications open many simultaneous TCP connections (e.g., database connection pools) and use short connection‑timeout settings, the default unres_qlen value can cause sporadic connection failures. For kernels earlier than 3.3, increasing unres_qlen (e.g., to 64) resolves the issue.

References:

Understanding RTT impact on TCP retransmissions

Linux kernel IP sysctl documentation

Additional note: The problem can also be reproduced by sending a large ping packet, which triggers the same ARP queue overflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TCPNetworkingConnection PoolingARPKernel Parametersunres_qlen
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.