Fundamentals 19 min read

Understanding Linux Network I/O, Socket Buffers, Ring Buffers, and Zero‑Copy Techniques

This article explains the Linux network I/O architecture, detailing how TCP sends and receives data, the role of socket buffers, SKB structures, QDisc and ring buffers, and compares various zero‑copy techniques such as read/write, mmap/write, sendfile, splice, tee, DPDK and DMA‑gather approaches.

Qunar Tech Salon

Jun 24, 2020

Understanding Linux Network I/O, Socket Buffers, Ring Buffers, and Zero‑Copy Techniques

Brief Introduction

This is the second article in the Linux I/O series. The previous article covered "Disk I/O and partial zero‑copy techniques in Disk I/O". This article discusses the structure of Linux network I/O and the much‑talked‑about zero‑copy techniques.

Socket Send and Receive Process

In Linux, a socket is the kernel abstraction for TCP/UDP; the following discussion focuses on TCP.

How TCP Sends Data

Figure 1

Figure 2

The application calls write / send , entering kernel space.

The kernel creates a sk_buff chain; each sk_buff can hold up to MSS bytes, effectively splitting the data. This chain is the socket send buffer.

The kernel checks the congestion window and the receiver window to see if the peer can accept more data, then creates a packet (TCP segment), adds the TCP header and performs TCP checksum.

IP routing is performed, the IP header is added, and IP checksum is calculated.

The packet is queued via QDisc and placed into the NIC driver’s Ring Buffer (Tx.ring) output queue.

The NIC driver uses the DMA engine to copy the packet from system memory to the NIC’s own memory.

The NIC appends an Inter‑Frame Gap, preamble and CRC to the frame, then generates an interrupt to notify the kernel that the packet has been sent.

How TCP Receives Data

Figure 3

Figure 4

(From bottom to top)

When a packet arrives, the NIC writes it into its own memory and verifies the CRC.

The NIC uses DMA to copy the packet into a pre‑allocated kernel buffer (the sk_buff linear buffer).

The link layer validates the packet and extracts the upper‑layer protocol.

The IP layer validates the IP checksum.

The TCP layer validates the TCP checksum.

Using the port information in the TCP control block, the kernel finds the corresponding socket and places the data into the socket receive buffer (the TCP receive window).

When the application calls read , the kernel copies data from the socket receive buffer to user space and removes it from the buffer.

1 Key Structures of Each Layer

1.1 Socket Layer's Socket Buffer

A socket is an abstract endpoint that applications can open, read, write, and close like a file.

Data written by the application ends up in a structure called the Socket Buffer.

1.1.1 Logical Concept

The Socket Buffer refers collectively to the send buffer and the receive buffer.

Send Buffer : After a process calls send(), the kernel copies the data into the socket’s send buffer. The TCP layer is responsible for delivering the data.

Receive Buffer : TCP/UDP store incoming network data here until the application reads it with recv().

1.1.2 SKB Data Structure (Linear Buffer)

The design of the Socket Buffer must satisfy two requirements:

Preserve the actual data transmitted on the network.

Minimize copies as data traverses protocol layers.

To achieve this, the kernel allocates sk_buff objects with alloc_skb and reserves headroom with skb_reserve. The sk_buff chain is a linear buffer that can be passed between layers without extra copying.

1.1.3 Summary (Important for Zero‑Copy Understanding)

sk_buff is created only in two situations:

When the application writes data to a socket.

When a packet arrives at the NIC.

Data is copied only twice:

Between user space and kernel space (socket read / write).

Between sk_buff and the NIC (DMA copy).

1.1.4 Misconceptions

According to "Unix Network Programming Volume 1, 2.11.2":

TCP sockets contain both send and receive buffers.

UDP sockets contain only a receive buffer.

In reality, UDP also has a write memory buffer ( udp_wmem_min ) and a read memory buffer ( udp_rmem_min ), as shown in the man page udp(7) .

1.2 QDisc

QDisc (queueing discipline) sits between the IP layer and the NIC’s ring buffer. It implements traffic control; its queue length is set by txqueuelen and is associated with the network device.

1.3 Ring Buffer

1.3.1 Introduction

A ring buffer is a fixed‑size FIFO queue whose head and tail are linked, providing lock‑free access and avoiding frequent memory allocation. In this article, "Ring Buffer" refers to the NIC driver queue that lies between the NIC hardware and the protocol stack. The ring buffer serves two main purposes:

Smooth the producer‑consumer speed mismatch.

Through NAPI, merge interrupts to reduce IRQ frequency.

1.3.2 Ring Buffer Misconceptions

Although the name contains "Buffer", it is actually a queue that does not store data itself, so no data copy occurs inside the ring buffer.

2 Summary of Network I/O Structure

All data in kernel space is stored in the Socket Buffer.

The Socket Buffer is the sk_buff chain, created only when data is written to a socket or when a packet reaches the NIC. sk_buff is allocated with alloc_skb and skb_reserve, and its size is limited by the MTU.

Data is copied only twice: user↔ sk_buff and sk_buff ↔NIC.

3 Zero‑Copy in Network I/O

3.1 DPDK

DPDK provides a user‑space driver that bypasses kernel interrupts. Packets are transferred directly from the NIC to user space via DMA, eliminating the kernel‑space copy. DPDK relies on the Userspace I/O (UIO) framework to map device memory and handle interrupts.

3.1.1 DPDK Drawbacks

Using DPDK requires rewriting large portions of the IP stack, leading to high development effort.

4 Zero‑Copy Across Disk I/O and Network I/O

4.1 read + write

Typical flow: user read / write → page cache ↔ user buffer ↔ socket buffer ↔ NIC. This involves 4 context switches, 2 CPU copies, and 2 DMA copies.

4.2 mmap + write

With mmap , the file is mapped into user space; a subsequent write copies data from the mapped region to the socket buffer (1 CPU copy). However, page faults may cause additional context switches for large files.

4.3 sendfile

sendfile performs a single system call, copying data from page cache directly to the socket buffer (1 CPU copy). It is efficient for large files.

4.3.1 Differences between sendfile, splice, tee

sendfile : Copies data from a regular file descriptor to a socket; the input fd must be seekable.

splice : Works with at least one pipe endpoint; can move data between any two file descriptors without copying to user space.

vmsplice : Maps user memory into a pipe without copying.

tee : Duplicates data from one pipe to another without copying.

4.4 sendfile + DMA Gather Copy

In theory, DMA gather could let sendfile avoid copying from kernel buffer to socket buffer, sending pages directly to the NIC. In practice, Linux does not expose an API that fully implements this zero‑copy path.

End

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux DPDK network I/O ring buffer socket buffer Zero-copy

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.