Unveiling Linux’s Network Send Path: From send() to the Wire
This article provides a deep, step‑by‑step analysis of how Linux 3.10 processes a send() call—from user‑space socket handling through TCP, IP routing, queueing, driver DMA mapping, and both hard and soft interrupt handling—answering why CPU time appears in sy/si, why NET_RX softirqs dominate, and what memory copies occur during transmission.
Overview of Linux Network Sending Process
The article walks through the complete path a packet takes after a user program calls send(), using a minimal server example and the Intel igb driver on Linux 3.10 as a concrete case study.
1. High‑Level Flow
Data moves from user space to kernel space, through the socket layer, TCP stack, IP layer, routing, neighbour resolution, queueing disciplines, and finally the network device driver, which hands the packet to the NIC.
2. NIC Initialization and Ring Buffers
Modern NICs support multiple queues, each represented by a ring buffer. During driver open ( __igb_open), the driver allocates transmit and receive descriptor arrays and starts all queues.
static int __igb_open(struct net_device *netdev, bool resuming)
{
struct igb_adapter *adapter = netdev_priv(netdev);
err = igb_setup_all_tx_resources(adapter);
err = igb_setup_all_rx_resources(adapter);
netif_tx_start_all_queues(netdev);
}Each ring buffer contains two arrays: igb_tx_buffer (kernel‑managed) and e1000_adv_tx_desc (hardware‑managed DMA descriptors).
3. accept() Creates a New Socket
After accept(), the kernel creates a new socket object linked to the process’s file descriptor table. The article references a separate deep‑dive on epoll for the full source walk.
4. The Sending Journey
4.1 send() System Call
send()is a thin wrapper around sendto(). It looks up the socket, builds a struct msghdr, and forwards to sock_sendmsg, which eventually calls inet_sendmsg for TCP sockets.
SYSCALL_DEFINE4(send, int, fd, void __user *, buff, size_t, len,
unsigned int, flags)
{
return sys_sendto(fd, buff, len, flags, NULL, 0);
}4.2 Transport Layer – TCP
inet_sendmsgdispatches to the protocol‑specific tcp_sendmsg. The function allocates an skb, copies user data into it, and may enqueue the skb in the socket’s write queue.
int tcp_sendmsg(struct kiocb *iocb, struct sock *sk,
struct msghdr *msg, size_t size)
{
while (--iovlen >= 0) {
unsigned char __user *from = iov->iov_base;
if (copy <= 0) {
skb = sk_stream_alloc_skb(sk, select_size(sk, sg),
sk->sk_allocation);
skb_entail(sk, skb);
}
if (skb_availroom(skb) > 0)
skb_add_data_nocache(sk, skb, from, copy);
}
/* push logic omitted for brevity */
}Two memory copies occur here: user‑space to skb, and later when the kernel clones the skb for transmission.
4.3 TCP Transmission – tcp_write_xmit
When sending conditions are met (e.g., Nagle’s algorithm or forced push), tcp_write_xmit pulls skbs from the write queue, performs congestion‑window checks, and finally calls tcp_transmit_skb.
static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now,
int nonagle, int push_one, gfp_t gfp)
{
while ((skb = tcp_send_head(sk))) {
cwnd_quota = tcp_cwnd_test(tp, skb);
tcp_snd_wnd_test(tp, skb, mss_now);
/* TSO, segmentation, etc. */
tcp_transmit_skb(sk, skb, 1, gfp);
}
return true;
}4.4 Network Layer – ip_queue_xmit
The skb reaches the IP layer where routing lookup, IP header construction, and netfilter processing happen before handing the packet to dst_output.
int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl)
{
struct rtable *rt = (struct rtable *)__sk_dst_check(sk, 0);
if (!rt)
rt = ip_route_output_ports(...);
skb_dst_set_noref(skb, &rt->dst);
iph = ip_hdr(skb);
iph->protocol = sk->sk_protocol;
iph->ttl = ip_select_ttl(inet, &rt->dst);
ip_local_out(skb);
return 0;
}4.5 Neighbour Subsystem
Before the packet can be placed on the wire, the neighbour (ARP) cache is consulted to resolve the next‑hop MAC address. If missing, an ARP request is emitted.
static inline int dst_neigh_output(struct dst_entry *dst,
struct neighbour *n,
struct sk_buff *skb)
{
return n->output(n, skb);
}4.6 Device Queueing and qdisc
The packet is enqueued on the device’s transmit queue. The queueing discipline ( qdisc) may bypass the scheduler or enqueue the skb for later transmission.
int dev_queue_xmit(struct sk_buff *skb)
{
struct netdev_queue *txq = netdev_pick_tx(dev, skb);
struct Qdisc *q = rcu_dereference_bh(txq->qdisc);
if (q->enqueue)
q->enqueue(skb, q);
__qdisc_run(q);
return 0;
}4.7 Driver Transmission – igb_xmit_frame
The driver’s ndo_start_xmit implementation ( igb_xmit_frame) maps the skb’s data to DMA, fills the hardware descriptors, and notifies the NIC.
static netdev_tx_t igb_xmit_frame(struct sk_buff *skb,
struct net_device *netdev)
{
struct igb_ring *tx_ring = igb_tx_queue_mapping(adapter, skb);
struct igb_tx_buffer *first = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
first->skb = skb;
first->bytecount = skb->len;
igb_tx_map(tx_ring, first, hdr_len);
return NETDEV_TX_OK;
}The mapping creates DMA addresses for the packet data and for each fragment, then writes the descriptors and issues a memory barrier before the NIC fetches them.
4.8 Completion – Hard and Soft IRQs
When the NIC finishes transmitting, it raises a hard interrupt. The interrupt handler schedules a soft IRQ of type NET_RX_SOFTIRQ (yes, even for transmit completions). The soft‑IRQ runs igb_poll, which calls igb_clean_tx_irq to free the original skb, unmap DMA, and clear the ring buffer entry.
static bool igb_clean_tx_irq(struct igb_q_vector *q_vector)
{
dev_kfree_skb_any(tx_buffer->skb);
tx_buffer->skb = NULL;
dma_unmap_len_set(tx_buffer, len, 0);
/* advance ring pointers */
return true;
}Only after the remote ACK arrives is the cloned skb finally released, ensuring TCP reliability.
Answers to the Opening Questions
CPU accounting (sy vs si): Most of the send work runs in the process’s kernel mode (shown in sy). Only when the per‑CPU quota is exhausted does the kernel fall back to the soft‑IRQ path ( si).
Why NET_RX softirqs dominate: Both packet reception and transmit‑completion notifications trigger NET_RX_SOFTIRQ, while normal data transmission stays in process context, so NET_TX counts are much lower.
Memory copies involved: (1) User buffer → skb (kernel copy). (2) skb clone for transmission (to keep the original for possible retransmission). (3) Optional IP fragmentation copy when skb exceeds MTU.
Understanding this flow equips developers to pinpoint performance bottlenecks, tune queueing disciplines, adjust NIC offloads, or implement zero‑copy techniques where feasible.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
