How Does a Linux Packet Travel from NIC to Your App? Unveiling Zero‑Copy Secrets
This article walks through every stage a network packet undergoes inside the Linux kernel—from hardware interrupt and driver processing, through the sk_buff structures of the TCP/IP stack, to user‑space delivery—while explaining zero‑copy mechanisms like sendfile, splice, mmap and io_uring, and offering concrete tuning commands for optimal performance.
Introduction
As a systems or operations engineer you constantly face network‑related performance problems, yet few understand the exact path a packet takes from the network card to a user‑space application. This guide dissects the Linux network stack, explains why zero‑copy can boost throughput by dozens of times, and provides practical tuning steps.
1. The Packet’s Journey Through the Kernel
1.1 NIC Reception – the first hardware interrupt
When a frame arrives, the NIC writes it into the receive ring buffer (RX Ring) and raises a hard IRQ.
# 查看网卡中断情况
cat /proc/interrupts | grep eth0
# 查看网卡队列统计
ethtool -S eth0 | head -201.2 Driver handling
The NIC driver processes the interrupt in a fast, lock‑free context, moving the packet to the kernel’s software queue.
1.3 Soft‑IRQ (NET_RX_SOFTIRQ)
The driver schedules the NET_RX_SOFTIRQ which hands the packet to the network stack.
1.4 Kernel network stack processing
netif_rx() → NET_RX_SOFTIRQ → __netif_receive_skb() → protocol handling → socket buffer1.5 sk_buff structure
The core data container is struct sk_buff, which holds pointers to the packet data, metadata, and linked list links.
struct sk_buff {
struct sk_buff *next;
struct sk_buff *prev;
struct net_device *dev;
unsigned char *head;
unsigned char *data;
unsigned char *tail;
unsigned char *end;
// ... more fields
};2. TCP/IP Stack Implementation in Linux
2.1 Layered processing
Linux follows the OSI model: physical → data‑link (L2) → network (L3) → transport (L4) → socket.
# 查看以太网帧处理统计
cat /proc/net/dev2.2 IP layer
int ip_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt, struct net_device *orig_dev) {
// IP header validation
if (!pskb_may_pull(skb, sizeof(struct iphdr)))
goto inhdr_error;
// Routing lookup
if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))
goto drop;
// Pass to transport layer
return dst_input(skb);
}2.3 TCP state machine
enum tcp_states {
TCP_ESTABLISHED = 1,
TCP_SYN_SENT,
TCP_SYN_RECV,
TCP_FIN_WAIT1,
TCP_FIN_WAIT2,
TCP_TIME_WAIT,
TCP_CLOSE,
// ... more states
};2.4 TCP tuning parameters
# TCP window tuning
echo 'net.core.rmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_congestion_control = bbr' >> /etc/sysctl.conf3. Zero‑Copy Techniques
3.1 Traditional I/O bottlenecks
Typical data flow involves multiple copies and context switches:
Disk → kernel buffer → user buffer → socket buffer → NICThis results in four copies, four user↔kernel switches, and high CPU usage.
3.2 sendfile()
#include <sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
int send_file_zero_copy(int socket_fd, int file_fd, size_t file_size) {
off_t offset = 0;
ssize_t sent;
while (offset < file_size) {
sent = sendfile(socket_fd, file_fd, &offset, file_size - offset);
if (sent <= 0) {
if (errno == EAGAIN) continue;
return -1;
}
offset += sent;
}
return 0;
}3.3 splice() and tee()
ssize_t splice(int fd_in, loff_t *off_in, int fd_out,
loff_t *off_out, size_t len, unsigned int flags);
ssize_t tee(int fd_in, int fd_out, size_t len, unsigned int flags);3.4 mmap() based zero‑copy
void *mapped = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
ssize_t sent = send(socket_fd, mapped, st.st_size, 0);
munmap(mapped, st.st_size);3.5 Modern async I/O – io_uring
#include <liburing.h>
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
sqe = io_uring_get_sqe(&ring);
io_uring_prep_sendfile(sqe, socket_fd, file_fd, 0, file_size);
io_uring_submit(&ring);
io_uring_wait_cqe(&ring, &cqe);4. Performance Monitoring & Tuning
4.1 Real‑time interrupt and soft‑irq stats
# Network interrupt distribution
cat /proc/interrupts | grep -E "(CPU|eth)"
# Soft‑irq statistics
watch -n 1 'cat /proc/softirqs | head -2 && cat /proc/softirqs | grep NET'4.2 Queue depth and RPS
for i in /sys/class/net/*/queues/rx-*/rps_cpus; do
echo "$i: $(cat $i)"
done4.3 eBPF tracing example (TCP send latency)
SEC("kprobe/tcp_sendmsg")
int trace_tcp_sendmsg(struct pt_regs *ctx) {
u64 ts = bpf_ktime_get_ns();
u32 pid = bpf_get_current_pid_tgid() >> 32;
bpf_map_update_elem(×tamps, &pid, &ts, BPF_ANY);
return 0;
}4.4 Production checklist
Enable multi‑queue NIC (RSS/RPS).
Bind interrupts to specific CPUs.
Tune kernel network parameters (net.core.*, net.ipv4.tcp_*).
Adopt zero‑copy system calls (sendfile, splice, io_uring).
Continuously monitor latency, drops, and retransmissions.
5. Real‑World Impact
5.1 Web server benchmark
Using a 1 GB file:
Traditional read/write copy: ~2.3 s, CPU ≈ 85 %.
Zero‑copy (sendfile): ~0.8 s, CPU ≈ 12 %.
5.2 Typical zero‑copy use‑cases
Static file serving (Nginx, Caddy).
Reverse proxies (HAProxy, Envoy).
Message brokers (Kafka, Pulsar).
Database file transfer (MySQL, PostgreSQL).
Conclusion
By tracing a packet from the NIC through the Linux kernel’s layered processing, understanding the sk_buff data path, and applying zero‑copy system calls, engineers can dramatically reduce latency and CPU overhead. Combined with modern tools such as eBPF and io_uring, these techniques enable high‑throughput, low‑latency services in production environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
