Fundamentals 19 min read

In‑Depth Analysis of TCP Connection Timeout, TIME_WAIT, Nagle Algorithm and Kernel Parameters

This article examines three common TCP issues—missing client‑side socket timeouts leading to monitor‑killed processes, excessive TIME_WAIT sockets after service failures and their kernel tunables, and 40 ms keep‑alive latency caused by Nagle and delayed ACK—explaining kernel behavior and offering practical configuration fixes.

Tencent Music Tech Team

May 20, 2016

In‑Depth Analysis of TCP Connection Timeout, TIME_WAIT, Nagle Algorithm and Kernel Parameters

In network programming, abnormal situations often require a deep understanding of the TCP/IP protocol stack. This article presents three common problems encountered in production, analyzes the kernel implementation, and proposes practical solutions.

Problem 1: Server process killed by monitor due to heartbeat timeout

The backend framework uses a proxy‑worker model. The monitor kills a worker if it does not reply to a heartbeat within 60 seconds. The root cause was a missing socket timeout on the client side, causing the connect() call to block until the kernel’s own timeout expires.

The default timeout is determined by the kernel’s TCP connect implementation. The relevant code (net/ipv4/af_inet.c) shows that

int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr, int addr_len, int flags)

obtains the send timeout via sock_sndtimeo(sk, flags & O_NONBLOCK). If no send timeout is set, the kernel relies on the retransmission timer.

During the three‑way handshake, after the SYN is sent, the kernel starts a retransmission timer that uses exponential back‑off. The number of SYN retries is controlled by the sysctl net.ipv4.tcp_syn_retries (default 5). The total timeout is the sum of the back‑off intervals (1 s, 2 s, 4 s, 8 s, 16 s, …).

Verification with telnet and tcpdump shows a total elapsed time of ~63 seconds when the destination is unreachable, matching the back‑off schedule.

To set a custom timeout, either use a non‑blocking socket and select() / poll(), or set the send timeout before connect() (blocking mode):

int connect_with_timeout()
{
    struct timeval timeo = {1, 0};
    socklen_t len = sizeof(timeo);
    int fd = socket(AF_INET, SOCK_STREAM, 0);
    setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo, len);
    return connect(fd, (struct sockaddr*)&addr, sizeof(addr));
}

Problem 2: Excessive TIME_WAIT sockets after backend service failures

Short‑lived connections to failing backend services generate many sockets in the TIME_WAIT state, eventually exhausting file descriptors. The /proc/net/sockstat output and commands like netstat -ant or ss -ant reveal the count.

Linux provides three tunables to mitigate TIME_WAIT buildup: tcp_tw_recycle – fast recycling of TIME_WAIT sockets (requires monotonically increasing timestamps; unsafe behind NAT). tcp_tw_reuse – allows reuse of TIME_WAIT sockets for new connections when sequence numbers are safe. tcp_max_tw_buckets – caps the total number of TIME_WAIT sockets; excess connections are closed immediately.

Enabling tcp_tw_recycle checks that the incoming SYN carries a timestamp newer than any recent packet from the same IP within the MSL window. In NAT environments this can cause legitimate connections to be rejected because timestamps are not guaranteed to increase.

The kernel code (net/ipv4/tcp_ipv4.c) illustrates the check:

int tcp_conn_request(struct request_sock_ops *rsk_ops,
                     const struct tcp_request_sock_ops *af_ops,
                     struct sock *sk, struct sk_buff *skb)
{
    if (!want_cookie && !isn) {
        if (tcp_death_row.sysctl_tw_recycle) {
            bool strict;
            dst = af_ops->route_req(sk, &fl, req, &strict);
            if (dst && strict &&
                !tcp_peer_is_proven(req, dst, true, tmp_opt.saw_tstamp)) {
                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
                goto drop_and_release;
            }
        }
    }
    ...
}

For tcp_tw_reuse, the kernel ensures the new SYN’s sequence number and timestamp are greater than those of the existing TIME_WAIT socket.

For tcp_max_tw_buckets, allocation fails when the count exceeds the configured limit:

struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
                                         struct inet_timewait_death_row *dr,
                                         const int state)
{
    if (atomic_read(&dr->tw_count) >= dr->sysctl_max_tw_buckets)
        return NULL;
    ...
}

Best practice: keep TIME_WAIT sockets alive and reduce their number by connection pooling, or increase net.ipv4.ip_local_port_range. Use tcp_tw_reuse only on client‑only machines; avoid tcp_tw_recycle in NAT environments.

Problem 3: 40 ms latency on keep‑alive connections caused by Nagle’s algorithm and delayed ACK

During a keep‑alive HTTP request, the server sends two small packets (header and body). The second packet is delayed ~40 ms because the first packet’s ACK is postponed by the delayed‑ACK timer. The Nagle algorithm holds back small packets until the previous data is acknowledged.

Key kernel snippets (net/ipv4/tcp_output.c) show the Nagle check:

static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
                            int nonagle)
{
    return partial &&
           ((nonagle & TCP_NAGLE_CORK) ||
            (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
}

The delayed‑ACK logic (net/ipv4/tcp_input.c) decides whether to send an immediate ACK or schedule a delayed one:

static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
{
    struct tcp_sock *tp = tcp_sk(sk);
    if (/* large packet or quick‑ack mode or out‑of‑order */)
        tcp_send_ack(sk);
    else
        tcp_send_delayed_ack(sk);
}

When a connection starts, the kernel is in “quick‑ack” mode to accelerate slow‑start, so short connections do not suffer from Nagle or delayed ACK. In long‑lived connections, the session quickly enters “ping‑pong” mode (ACK timeout ≈ 40 ms, min(rtt, 200 ms)), causing the observed delay.

Solutions include merging HTTP header and body into a single write, or disabling Nagle’s algorithm (e.g., setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, …)). Popular servers such as Nginx already disable Nagle for keep‑alive connections.

All images referenced are from “TCP/IP Illustrated” and the code excerpts are from the Linux kernel source.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TCP Linux kernel TIME_WAIT connection timeout Nagle algorithm socket programming

Written by

Tencent Music Tech Team

Public account of Tencent Music's development team, focusing on technology sharing and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.