Operations 19 min read

Why Two Copies Outperform One: Designing bpf_sock_splice_pair for High‑Speed TCP Loopback

The article examines the design of the new BPF function bpf_sock_splice_pair for intra‑host TCP communication, explains why a single‑copy implementation is suboptimal, introduces a ring‑buffer based two‑copy approach with optional busy‑polling, and presents benchmark results showing up to 7× throughput gains over the baseline.

Linux Kernel Journey

Jun 13, 2026

Why Two Copies Outperform One: Designing bpf_sock_splice_pair for High‑Speed TCP Loopback

Modern infrastructure often carries intra‑host traffic over ordinary TCP sockets, incurring full‑stack overhead such as skb allocation, socket memory accounting, soft‑interrupt handling, loopback device processing, and the complete TCP receive path. A new BPF kfunc bpf_sock_splice_pair() pairs two locally connected TCP sockets at handshake time via a SOCKMAP program, allowing data to travel over a short kernel path while preserving normal TCP semantics (sequence numbers, FIN/RST, keepalive, and normal close). Applications need no code changes, new address families, or preloaded libraries.

First version: single‑copy design

The initial implementation targets the theoretical minimum of one memory copy. The receiver pins its user page in recvmsg() and publishes the iovec to the paired socket. The sender, in sendmsg(), copies the payload directly into the receiver’s page. A short timeout (≤1 ms) prevents deadlock; if the receiver does not publish in time, data falls back to the regular TCP path, allowing handshake‑style traffic (e.g., SSH banner, TLS hello) to make progress.

Zero‑copy is impossible with the standard socket API because the two processes reside in separate address spaces: send() receives a pointer to the sender’s buffer and recv() receives a pointer to the receiver’s buffer, forcing a copy across page tables. The only ways to avoid this copy would require breaking the API contract—shared memory, page remapping, or pipe‑based splice() —each of which changes the programming model.

Linux provides true zero‑copy facilities ( MSG_ZEROCOPY with SO_ZEROCOPY on the send side and TCP_ZEROCOPY_RECEIVE on the receive side), but they require explicit application support and impose constraints, so they cannot be used transparently with unmodified send() / recv() calls.

Consequently, the single‑copy design hits a fundamental limit: at least one copy is required when using unmodified socket calls. Moreover, the design forces a synchronous rendezvous between sender and receiver, destroying batching opportunities. The sender must wait for the receiver to have a published page, which stalls pipelines and reduces throughput, especially for bursty, asynchronous workloads.

Why a single copy is the wrong trade‑off

Minimising copy count does not maximise throughput because the lack of a buffer forces the producer to wait for the consumer. Queueing theory shows that a buffer between producer and consumer is essential for high throughput; the extra copy incurred by a buffer is the price of decoupling.

Second version: a small ring buffer

The revised implementation introduces a per‑direction 16 KiB ring buffer (two buffers total). sendmsg() copies data into the head of the ring; recvmsg() copies from the tail. This adds a second copy but eliminates the rendezvous, allowing the producer to stay ahead of the consumer.

The ring buffer is a single‑producer, single‑consumer structure using per‑CPU reference counting and cached cursor values to avoid lock contention. The kernel updates the peer’s cursor only when the local view indicates the ring is full or empty, keeping the hot path off shared cache lines. Correctness is maintained by yielding to the regular TCP path when the peer’s TCP receive queue is non‑empty or the ring is full, ensuring that any data present in the TCP queue is older than data in the ring.

Busy‑polling on the ring

With the ring buffer in place, the receiver can optionally busy‑poll using the existing SO_BUSY_POLL socket option. A BPF helper ( bpf_setsockopt()) sets a microsecond budget without requiring sysctl changes. Busy‑polling lets the receiver spin on the ring, eliminating wake‑up overhead for each request‑response pair and dramatically reducing latency for latency‑sensitive traffic.

Benchmark results

All measurements use netperf with 1 KB request/response (TCP_RR) on adjacent CPUs. Tests were run on bare‑metal loopback (127.0.0.1) and in containers (two network namespaces linked by a veth pair and a Linux bridge). Results are the average of three 10‑second runs:

Bare‑metal loopback: baseline 105.8 k tps; splice without busy‑poll 235.1 k tps (2.2×); splice with 50 µs busy‑poll 713.0 k tps (6.7×).

Containers: baseline 99.9 k tps; splice without busy‑poll 233.9 k tps (2.3×); splice with 50 µs busy‑poll 704.9 k tps (7.0×).

Small messages benefit most (1‑byte requests achieve ~10× with busy‑poll). Larger messages (up to 64 KB) see diminishing returns as memory‑copy bandwidth becomes the bottleneck.

For TCP_STREAM (large‑payload streaming), bare‑metal loopback shows little gain because the kernel’s TSO already reduces per‑packet overhead, but container‑to‑container paths see up to 6× improvement due to avoided veth/bridge processing.

Comparison with AF_SMC

Linux already offers AF_SMC (Shared Memory Communications) with an SMC‑D variant for loopback. SMC‑D also uses a shared ring buffer but performs three copies (user→send buffer, send buffer→shared buffer, shared buffer→user) and lacks busy‑polling. Benchmarking shows SMC‑D achieves ~169 k tps, while the BPF ring buffer without busy‑poll reaches ~235 k tps (1.4×) and with busy‑poll ~713 k tps (≈4×), confirming the advantage of fewer copies and the optional busy‑poll.

Lessons learned

The key takeaway is that minimizing copy count is not the optimal goal; decoupling producer and consumer with a modest buffer enables batching, which yields far greater throughput gains. In the Linux kernel, copy costs are easy to measure, tempting developers to eliminate them, but structural improvements like buffering and busy‑polling provide larger, more systemic benefits.

The bpf_sock_splice_pair() implementation has been submitted as an RFC patch series (https://lore.kernel.org/all/[email protected]/) to the BPF and netdev mailing lists.

第一版（单次拷贝，需会合）:
    发送方 sendmsg() ----------- 拷贝 ----------> 接收方钉住的页
                 （两端必须在同一刻同时在场）

第二版（两次拷贝，已解耦）:
    发送方 sendmsg() --拷贝--> [ 环 ] --拷贝--> 接收方 recvmsg()
                     ^ 跨多次调用持续累积，发送方得以跑在接收方前头

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance kernel TCP loopback BPF ring buffer busy polling

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.