Fundamentals 29 min read

How to Boost Linux Pipe Throughput from 3.5 GiB/s to 65 GiB/s

Using a step‑by‑step example program, this article shows how to dramatically improve Linux pipe read/write performance—from an initial 3.5 GiB/s to 65 GiB/s—by applying zero‑copy techniques, ring buffers, paging insights, vmsplice/splice system calls, huge pages, and busy‑loop optimizations.

MaGe Linux Operations

Nov 14, 2022

How to Boost Linux Pipe Throughput from 3.5 GiB/s to 65 GiB/s

This article demonstrates, through a concrete test program, how to optimise Linux pipe read/write performance, raising throughput from roughly 3.5 GiB/s to 65 GiB/s. The discussion covers zero‑copy operations, ring buffers, paging, virtual‑memory mapping, and synchronisation overhead, targeting developers working on high‑performance Linux applications or kernel‑level code.

Baseline Test

We start with a simple benchmark that writes 256 KiB blocks of data to a pipe using the standard write system call. The program reads 10 GiB and reports throughput in GiB/s. On the author’s laptop the baseline speed is about 3.7 GiB/s, ten times slower than a highly optimised FizzBuzz program that pushes ~36 GiB/s through a pipe.

# ./write | ./read
3.7GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)

Perf Analysis of write

Using perf record -g we see that roughly 48 % of the time is spent in __GI___libc_write, and within the kernel the dominant cost is pipe_write. About three‑quarters of that time is spent copying pages or allocating them, confirming that the kernel copies data twice (user→kernel and kernel→user) and incurs lock contention.

Pipe Internals

A Linux pipe is a circular buffer of struct pipe_buffer entries, each referencing a struct page. The buffer typically holds 16 slots (32 KiB on x86‑64) and tracks head and tail indices. When the buffer is full, writers block; when empty, readers block.

struct pipe_inode_info {
  unsigned int head;
  unsigned int tail;
  struct pipe_buffer *bufs;
};

struct pipe_buffer {
  struct page *page;
  unsigned int offset, len;
};

Why write Is Slow

Each page is copied twice (user→kernel, kernel→user).

Copying occurs page‑by‑page (4 KiB each) with additional lock/unlock overhead.

Frequent page allocation leads to non‑contiguous memory.

Locking the pipe for every write adds synchronisation cost.

Zero‑Copy with vmsplice and splice

The vmsplice system call moves user‑space buffers into the pipe without copying, while splice moves data from the pipe to another file descriptor without copying. Replacing write with vmsplice and read with splice eliminates the double copy.

ssize_t vmsplice(int fd, const struct iovec *iov, size_t nr_segs, unsigned int flags);

Using a double‑buffered scheme (two 128 KiB halves) and a pipe size of 128 KiB, the program repeatedly splices each half into the pipe.

int main() {
  size_t buf_size = 1 << 18; // 256KiB
  char *buf = malloc(buf_size);
  memset(buf, 'X', buf_size);
  char *bufs[2] = { buf, buf + buf_size/2 };
  int buf_ix = 0;
  while (true) {
    struct iovec bufvec = { .iov_base = bufs[buf_ix], .iov_len = buf_size/2 };
    buf_ix = (buf_ix + 1) % 2;
    while (bufvec.iov_len > 0) {
      ssize_t ret = vmsplice(STDOUT_FILENO, &bufvec, 1, 0);
      bufvec.iov_base = (char*)bufvec.iov_base + ret;
      bufvec.iov_len -= ret;
    }
  }
}

Running the program with vmsplice yields 12.7 GiB/s; adding splice on the read side pushes it to 32.8 GiB/s.

Improving Page Handling

Perf shows most remaining time is spent in iov_iter_get_pages, which converts user buffers into struct page objects via get_user_pages_fast. The conversion traverses the page‑table tree, which is costly.

Using Huge Pages

Allocating buffers with 2 MiB huge pages reduces the number of page‑table entries and speeds up get_user_pages_fast. The program allocates aligned memory and advises the kernel to use huge pages:

void *buf = aligned_alloc(1 << 21, size);
madvise(buf, size, MADV_HUGEPAGE);

With huge pages the throughput rises to 51.0 GiB/s.

Busy‑Looping to Avoid Blocking

When the pipe is full, vmsplice can return EAGAIN if called with SPLICE_F_NONBLOCK. A tight busy‑loop that retries immediately removes the blocking overhead, raising throughput to 62.5 GiB/s.

Final Results and Takeaways

By progressively applying zero‑copy splicing, huge‑page allocation, and non‑blocking busy‑loops, the original 3.5 GiB/s baseline is increased to over 60 GiB/s—a 20× improvement. The article highlights key concepts such as zero‑copy, ring buffers, paging, and synchronisation costs, which are valuable for anyone working on high‑performance Linux I/O.

"The code snippets are available on GitHub (https://github.com/bitonic/pipes-speed-test)."

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Linux Zero‑copy perf HugePages pipes vmsplice

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.