How Does Linux System Call I/O Work? A Deep Dive into Read/Write, Buffers, and Performance
The article explains Linux’s traditional system‑call I/O path, detailing how read() and write() trigger multiple CPU and DMA copies and context switches, describes read and write workflows, explores network and disk I/O, examines the Linux I/O stack, page cache, buffering strategies, zero‑copy, mmap and Direct I/O, and discusses performance trade‑offs.
Traditional System Call I/O
In Linux the classic way to access files or sockets is through the read() and write() system calls. read() copies data from a kernel buffer into a user‑space buffer, while write() copies data from a user buffer into a kernel buffer that is later transmitted to the device.
The data path involves two CPU copies, two DMA copies and four context switches (user ↔ kernel on entry and exit of each call).
CPU copy : Direct memory‑to‑memory copy performed by the CPU.
DMA copy : The CPU programs the DMA engine to move data between main memory and the device, freeing the CPU during the transfer.
Context switch : Transition from user mode to kernel mode when the system call is invoked and back when it returns.
Read Operation
If the requested data is already present in the process's page cache, the kernel returns it directly from memory. Otherwise the data is first fetched from the storage device into the kernel's read buffer and then copied to the user buffer. ssize_t n = read(fd, buf, len); The traditional read path triggers:
User process calls read() → kernel entry (context switch).
CPU programs DMA to move data from disk (or main memory) into the kernel read buffer.
CPU copies data from the read buffer to the user buffer.
Kernel returns to user space (context switch).
Write Operation
When an application calls write(), data is first copied from the user page cache into the kernel's socket (or block) buffer, then DMA transfers it to the NIC or storage device. ssize_t n = write(fd, buf, len); The traditional write path also incurs two context switches, one CPU copy and one DMA copy:
User process invokes write() → kernel entry.
CPU copies data from the user buffer to the kernel socket/block buffer.
CPU programs DMA to move the kernel buffer to the NIC or disk.
Kernel returns to user space.
PageCache and High‑Performance Optimizations
The OS maintains a PageCache – a cache of file contents stored in memory pages. It reduces disk I/O by serving reads directly from memory and by coalescing writes.
Read strategy : On a read() the kernel checks PageCache. If the data is present, it is copied directly to the user buffer (single copy). If not, the kernel schedules a disk read, loads a few pages (typically 1–3) into PageCache, and then copies the requested page to the user.
Write strategy : Data written via write() first lands in PageCache and is marked dirty. A background flusher writes dirty pages back to disk when any of the following conditions occur:
Free memory falls below a configurable threshold.
Dirty pages have been resident for longer than the dirty‑expire timeout.
The application explicitly calls sync() or fsync().
Typical high‑performance techniques that build on this foundation are:
Zero‑copy I/O (e.g., sendfile(), splice()).
I/O multiplexing ( epoll, select, poll).
Direct I/O (bypassing PageCache).
Linux I/O Stack
The Linux I/O subsystem can be viewed as three logical layers:
Filesystem layer : Translates file operations into block I/O requests and manages the PageCache.
Block layer : Queues, merges and schedules block requests; applies I/O schedulers (CFQ, deadline, noop, etc.).
Device layer : Interacts with hardware drivers; uses DMA to move data between memory and the device.
Different I/O APIs map onto these layers:
Buffered I/O (standard read() / write()) uses the PageCache and incurs the copies described above.
Memory‑mapped I/O ( mmap()) maps PageCache pages directly into the process address space, eliminating the second copy (kernel → user).
Direct I/O ( O_DIRECT) bypasses PageCache; data is transferred directly between user buffers and the block device via DMA. It requires page‑aligned buffers and I/O sizes that are multiples of the underlying block size.
I/O Buffering Overview
Three buffering levels are relevant to application performance:
Userspace stdio buffers : Functions such as fread() / fwrite() keep a private buffer to reduce the number of system calls. The buffer can be flushed manually with fflush() or disabled with setbuf() / setvbuf().
Kernel buffer cache (PageCache for file data, BufferCache for raw block data): Holds data between the filesystem and the block layer.
Device buffers : Hardware may have its own FIFO or cache; DMA moves data between main memory and these buffers.
Understanding the interaction of these layers is essential for designing high‑performance applications. For example, using mmap() can eliminate the user‑kernel copy for reads, while O_DIRECT removes the PageCache copy for both reads and writes at the cost of stricter alignment requirements.
When evaluating performance, typical metrics include:
Number of context switches per I/O operation.
Number of CPU copies (user ↔ kernel, kernel ↔ device).
DMA transfer size and latency.
Cache hit ratio in PageCache.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
