Why Does TCP Send‑Q Exceed SO_SNDBUF? Inside Linux Kernel Buffer Mechanics
A Linux client sets SO_SNDBUF to 4096 bytes, sends 1 KB packets while the server never reads, yet the TCP Send‑Q grows to 14480 bytes because the kernel doubles the buffer size, adds overhead, and expands memory via GSO and sk_wmem scheduling.
Problem Overview
A client creates a TCP socket, sets SO_SNDBUF to 4096 bytes, and sends a 1024‑byte segment every second. The server never calls recv(). The expected behavior is divided into three phases: (1) ACKs are still sent, (2) the server’s receive window closes (zero‑window), and (3) the client’s send buffer fills and send() blocks.
Observed Anomaly
Monitoring the connection with ss -nt shows the Send‑Q value rising from 0 to 14480, far exceeding the configured SO_SNDBUF of 4096 bytes.
Kernel Doubling of SO_SNDBUF
When the user sets SO_SNDBUF, the kernel stores val*2 in sk->sk_sndbuf:
@sock.c: sock_setsockopt
case SO_SNDBUF:
sk->sk_sndbuf = max(val * 2, SOCK_MIN_SNDBUF);Thus a 4096‑byte request becomes an internal limit of 8192 bytes. The kernel does this to reserve space for sk_buff structures, skb_shared_info, and L2/L3/L4 headers.
Why Send‑Q Still Grows
The kernel tracks the actual memory used by the send queue in sk->wmem_queued, which includes both user data and the overhead of each sk_buff. Consequently sk->wmem_queued can be larger than the visible Send‑Q value.
@sock.h
bool sk_stream_memory_free(const struct sock *sk) {
if (sk->sk_wmem_queued >= sk->sk_sndbuf)
return false; // not enough memory
...
}Each time a packet is queued, sk_wmem_queued increases by skb->truesize (the packet size plus overhead) and decreases when the packet is ACKed.
tcp_sendmsg Logic
The function tcp_sendmsg decides whether to allocate a new sk_buff or to append data to the last one in the write queue.
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) {
while (msg_data_left(msg)) {
int copy = 0;
int max = size_goal;
skb = tcp_write_queue_tail(sk);
if (tcp_send_head(sk)) {
copy = max - skb->len;
}
if (copy <= 0) {
// Case 1: allocate new skb
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
skb = sk_stream_alloc_skb(...);
} else {
// Case 2: append to existing skb
if (!sk_wmem_schedule(sk, copy))
goto wait_for_memory;
skb_copy_to_page_nocache(..., copy);
}
}
}Case 1 – Allocate New sk_buff
During Phase 1 the client creates a fresh sk_buff for each 1024‑byte segment, and the kernel’s sk_stream_memory_free check passes because the internal buffer (8192 bytes) is not yet exhausted.
Case 2 – Append to Existing sk_buff
In Phase 2 the client’s send buffer already contains queued sk_buff s, so the kernel tries to append new data to the last sk_buff. The amount that can be appended is copy = size_goal - skb->len.
How size_goal Is Computed
size_goalis derived from the maximum segment size (MSS) and the GSO (Generic Segmentation Offload) setting of the NIC:
static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed) {
if (!large_allowed || !sk_can_gso(sk))
return mss_now;
size_goal = tp->gso_segs * mss_now;
return max(size_goal, mss_now);
}GSO enabled: size_goal = tp->gso_segs * mss_now GSO disabled: size_goal = mss_now In the author's test environment the effective MSS is 1448 bytes, and size_goal becomes 14480 bytes (10 × MSS). Therefore, when Phase 2 starts, tcp_sendmsg calculates copy = 14480 - 1024 = 13456 bytes.
Memory Expansion via sk_wmem_schedule
Before copying data, tcp_sendmsg calls sk_wmem_schedule, which can increase the socket’s allocated memory beyond sk_sndbuf:
if (!sk_wmem_schedule(sk, copy))
goto wait_for_memory;The helper __sk_mem_schedule adds memory in quantum units and performs additional checks, allowing sk_wmem_queued to exceed the original sk_sndbuf limit when system memory permits.
Implications and Possible Fixes
Because the kernel can grow the send queue past the user‑specified SO_SNDBUF, the setting becomes ineffective in practice. Two practical mitigations are suggested:
Disable the NIC’s GSO feature, which reduces size_goal to the MSS.
Patch the kernel to move the sk_stream_memory_free check to the beginning of the while loop in tcp_sendmsg, preventing allocation when the buffer is already full.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
