Fundamentals 19 min read

From write() to recv(): Tracing a Packet Through the Linux Kernel

This article walks through every stage a packet undergoes inside the Linux kernel—from the moment an application calls write() or send() to the final recv() call—covering socket handling, routing, ARP/NDP resolution, queuing, NIC offloads, and reassembly with concrete commands and code examples.

BirdNest Tech Talk
BirdNest Tech Talk
BirdNest Tech Talk
From write() to recv(): Tracing a Packet Through the Linux Kernel

This guide walks a packet from a user‑space write() / send() call through the Linux kernel network stack to a recv() call, showing every kernel subsystem involved.

Simplified packet path

your app
 ↓ write()/send()
TCP (segments your bytes)
 ↓
IP (chooses where to send them)
 ↓
Neighbor/ARP (find next‑hop MAC)
 ↓
qdisc (queueing, pacing)
 ↓
driver/NIC (DMA to hardware)
 ↓
wire / Wi‑Fi / fiber
 ↓
NIC/driver (other host)
 ↓
IP (checks, decides it\'s for us)
 ↓
TCP (reassembles, ACKs)
 ↓
server app

Part One – Transmission: From write() to the wire

Step 1 – Application hands data to the kernel

Calling send() or write() on a TCP socket copies the user buffer into the kernel and queues it for transmission.

TCP splits large buffers into segments that fit the path. During the three‑way handshake each side advertises its Maximum Segment Size (MSS); the sender limits segment size to the peer’s MSS, the Path MTU, and any IP/TCP options (e.g., timestamps).

Each segment receives a sequence number so the receiver can reorder correctly.

Socket – a communication endpoint. The kernel tracks per‑socket state such as sequence number, congestion window, and timers.
TCP handshake – 1) SYN (options: MSS, SACK, window scaling, timestamps, ECN) 2) SYN‑ACK (options) 3) ACK. After the handshake the connection is ready for data; TLS runs on top of the established TCP stream.
Try it: ss -tni to watch the send and receive queues grow and shrink as data moves.

Step 2 – Routing decision

The kernel examines the destination IP and selects the most specific route. If the address belongs to a directly‑connected network, the packet is sent out that interface; otherwise it is handed to the default gateway.

Try it: ip route get 192.0.2.10 – the output shows the egress interface, next hop (if any), and source IP the kernel will use.
Policy routing: ip rule can query multiple routing tables based on source address, marks, etc. Most systems use the main table.

Step 3 – Neighbor discovery (ARP/NDP)

After routing selects the next hop, the kernel must resolve the link‑layer address.

If the neighbor cache already contains the MAC, the kernel proceeds.

Otherwise it broadcasts an ARP request (IPv4) or sends a multicast NDP solicitation (IPv6). The reply is cached.

Try it: ip neigh show – you will see entries such as 10.0.0.1 lladdr 00:11:22:33:44:55 REACHABLE .

Step 4 – Queueing discipline (qdisc)

Before the NIC transmits, the packet enters the per‑device qdisc, which can:

Smooth bursty traffic to avoid bufferbloat (large queues → high latency).

Share bandwidth fairly among flows.

Enforce shaping or rate‑limiting rules.

Try it: tc qdisc show dev eth0 or tc -s qdisc show dev eth0 . Replace eth0 with your actual interface name.
MTU vs MSS – MTU is the maximum L2 payload (typical Ethernet 1500 B). MSS is the largest TCP payload (≈ MTU − 40 B for IPv4 without options). Both sides announce MSS during the handshake, and the sender never exceeds the advertised MSS or the Path MTU.

Step 5 – NIC driver and hardware

The network driver places the packet into the NIC’s transmit ring. The NIC then uses DMA to fetch the bytes directly from RAM and converts them into bits on the wire (copper, fiber, or radio for Wi‑Fi).

Try it: ip -s link show dev eth0 → ethtool -S eth0 → ethtool -k eth0 . Replace eth0 with your interface.
Offloads – TSO/GSO split large buffers into MTU‑sized frames; checksum offload lets the NIC compute IP/TCP checksums after the kernel hands the packet over; GRO merges many small incoming packets into larger chunks to save CPU.
DMA – Direct Memory Access lets the NIC read/write RAM over PCIe without CPU copies, enabling high‑throughput TX/RX.

Step 6 – On the wire

On Ethernet the NIC emits a frame:

[ dst MAC | src MAC | EtherType (IPv4) | IP header | TCP header | payload | FCS ]

Switches forward the frame based on the destination MAC. Routers examine the IP header, decrement TTL (or Hop Limit), recompute the IPv4 checksum, and forward the packet to the next hop. This repeats hop‑by‑hop until the packet reaches the destination LAN.

Frame vs packet – A packet is the IP‑level unit (IP header + transport header + payload). A frame is that packet encapsulated for a specific link layer with MAC addresses and a CRC.

Part Two – Reception: From the wire back to the application

Step 7 – NIC hands data to the kernel (NAPI)

The NIC writes received frames into its receive ring. Linux’s NAPI first triggers an interrupt, then switches to polling to drain a batch of packets, finally re‑enables interrupts.

NAPI reduces interrupt load: 1) one interrupt, 2) poll to drain many packets, 3) re‑enable interrupt.

Step 8 – IP layer validation and routing decision

The kernel validates the IP header (version, checksum, TTL, etc.) and determines whether the packet is destined for the local host.

If the destination IP matches a local address, the packet proceeds up the stack.

If not and IP forwarding is enabled, the kernel may forward it like a router.

Otherwise the packet is dropped.

Firewall hooks (PREROUTING, INPUT, OUTPUT, POSTROUTING) in nftables/iptables can filter, log, DNAT, or SNAT the traffic before it reaches the socket.

Try it: sudo nft list ruleset or sudo iptables -L -n -v and sudo iptables -t nat -L -n -v .

Step 9 – TCP reassembly, ACK, and wake‑up

TCP reorders segments, detects missing data, and sends ACKs. When the receive buffer contains data for the waiting process, the kernel wakes the process blocked in recv().

Try it: ss -tni 'sport = :80 or dport = :80' – the Recv‑Q grows as data arrives and shrinks as the application reads.

Quick practical checklist

Loopback

Packets to 127.0.0.1 never leave the host; routing still occurs but everything stays in the lo interface memory.

Bridging vs routing

A bridge ( br0) forwards Ethernet frames at layer 2 without changing TTL. A router forwards at layer 3 and decrements TTL by one.

NAT hairpin

Accessing a service via its public IP from inside the LAN requires hairpin NAT. If connections reset, inspect PREROUTING and POSTROUTING NAT rules.

IPv6

Replace ARP with Neighbor Discovery Protocol (NDP). The rest of the path is identical.

ip -6 route
ip -6 neigh

UDP differences

UDP provides no ordering, retransmission, or congestion control. The send path uses udp_sendmsg; the receive path delivers whole datagrams, leaving loss handling to the application.

Ten useful commands

# 1) Where would the kernel send a packet?
ip route get 192.0.2.10
# 2) What routes and rules exist?
ip route; ip rule
# 3) Who\'s my next hop?
ip neigh show
# 4) What\'s my firewall/NAT doing?
sudo nft list ruleset
# or:
sudo iptables -L -n -v
sudo iptables -t nat -L -n -v
# 5) Which sockets are active?
ss -tni
# 6) What\'s on the wire?
sudo tcpdump -ni eth0 -e -vvv 'host 192.0.2.10 and tcp port 80'
# 7) Are my queues healthy?
tc -s qdisc show dev eth0
# 8) Is my NIC happy?
ip -s link show dev eth0
ethtool -S eth0
# 9) Are counters hinting at a problem?
nstat -a | grep -E 'InErrors|OutErrors|InNoRoutes|InOctets|OutOctets'
# 10) Is the path MTU safe?
tracepath 192.0.2.10   # discovers PMTU via ICMP

Common failure modes

ARP/neighbor flapping – REACHABLE ↔ STALE indicates L2 reachability problems, VLAN tagging issues, or switch filtering.

MTU/PMTU black‑hole – Small pings succeed while large transfers stall; usually a mismatched MTU or blocked ICMP Type 3 Code 4 (IPv4) / Type 2 (IPv6). Allow these ICMP messages through firewalls.

Reverse‑path filter – Asymmetric routing + rp_filter=1 drops return traffic. Use rp_filter=2 (loose) or make routing symmetric.

NAT surprises – Incorrect SNAT/MASQUERADE rewrites the source address, breaking replies. Verify NAT rules and conntrack -L.

Backlog pressure – Under heavy load new connections may reset. Increase the socket backlog and net.core.somaxconn so the application can accept promptly.

Bufferbloat – Oversized queues cause latency spikes. Switch the qdisc to fq_codel (or fq) and enable packet pacing if supported.

Kernel send path (TCP)

tcp_sendmsg
 → tcp_push_pending_frames
   → __tcp_transmit_skb
     → ip_queue_xmit
       → ip_local_out / ip_output
         → ip_finish_output
           → neigh_output
             → dev_queue_xmit
               → qdisc / sch_direct_xmit
                 → ndo_start_xmit (driver)

Kernel receive path (IPv4 TCP)

napi_gro_receive / netif_receive_skb
 → __netif_receive_skb_core
   → ip_rcv
     → ip_rcv_finish
       → ip_local_deliver
         → ip_local_deliver_finish
           → tcp_v4_rcv
             → tcp_v4_do_rcv
               → tcp_data_queue (wake reader)

Reference list

Socket – program handle for network I/O.

MTU / MSS – maximum link payload / maximum TCP payload.

ARP / NDP – resolve link‑layer address for IPv4 / IPv6.

qdisc – per‑device queueing policy (fairness, shaping).

NAPI – efficient receive: interrupt then batch poll.

TSO/GSO/GRO – offloads that split or merge packets to save CPU.

Conntrack – kernel flow table used by NAT and filtering.

PREROUTING / INPUT / OUTPUT / POSTROUTING – firewall hook points.

DMA – hardware reads/writes RAM without CPU copies.

TTL / Hop Limit – per‑packet counter decremented by each router; packet is dropped at zero.

FCS – frame check sequence (CRC) at the end of an Ethernet frame.

Source: https://www.0xkato.xyz/life-of-a-packet-in-the-linux-kernel/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KerneltcpLinuxSystem ProgrammingNetworkingPacket Flow
BirdNest Tech Talk
Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.