Understanding Linux I/O: From Application Buffers to Disk Writes
This article explains the Linux I/O stack, the flow of data from user‑space buffers through libc buffers, page cache, and kernel buffers to the physical disk, covering functions like fwrite, fflush, fsync, O_DIRECT, scheduling algorithms, consistency issues, and performance considerations.
Introduction
Linux I/O is the foundation of file storage. The article consolidates basic Linux I/O concepts from various online sources.
Linux I/O Stack
Linux file I/O follows a layered design that provides clear architecture and functional decoupling.
The application allocates a buffer, writes data with fwrite, which copies it to the libc (standard I/O) buffer. After fwrite returns, data remains in the libc buffer; if the process exits, the data is lost because it never reaches the disk.
Calling fclose flushes the libc buffer to the page cache, but the kernel buffer must also be flushed (e.g., with sync or fsync) to guarantee persistence on the disk. fflush only moves data from the libc buffer to the page cache; it does not write to the disk. sync schedules writes to the disk cache layer, but the actual write timing is decided by the disk controller.
I/O Call Chain
Typical fwrite execution involves multiple copies before data reaches the disk. Direct system calls read / write bypass the libc buffer and copy data straight from the application buffer to the page cache, at the cost of a user‑to‑kernel mode switch.
To bypass the page cache, open a file with O_DIRECT, causing write to go directly to the device. Writing directly to disk sectors is possible with RAW device access (e.g., fdisk, dd).
void foo() {
char *buf = malloc(MAX_SIZE);
strncpy(buf, src, MAX_SIZE);
fwrite(buf, MAX_SIZE, 1, fp);
fclose(fp);
}I/O Scheduling Layer
Tasks entering the I/O scheduler queue are not executed immediately; the scheduler aims to maximize overall disk I/O performance, often using an elevator algorithm that moves the disk head in one direction before reversing.
Linux provides several scheduler algorithms (e.g., noop, deadline, cfq). On SSDs, which lack moving heads, noop is usually appropriate.
After scheduling, the driver uses DMA to transfer data to the disk cache. The final write to the physical medium is controlled by the disk controller; invoking fsync forces the kernel to flush the data.
Consistency and Safety
5.1 Safety
If a process exits, data in the application or libc buffer is lost; data already in the page cache survives process termination. If the kernel crashes, any data not yet in the disk cache is lost. Power loss causes loss of all data not persisted to the disk.
5.2 Consistency
When the same process opens the same file multiple times and writes with separate file descriptors, each descriptor maintains its own file offset, causing writes to overlap and overwrite each other. Using O_APPEND ensures each write appends to the current file length, preventing overwrites.
fd1 = open("file", O_RDWR|O_TRUNC);
fd2 = open("file", O_RDWR|O_TRUNC);
while (1) {
write(fd1, "hello
", 6);
write(fd2, "world
", 6);
}The same principle applies across processes; each process has its own descriptor table, but using O_APPEND synchronizes offsets via the shared file length.
5.3 Read Process
The Linux read path proceeds as follows:
libc read invokes the sys_read system call. sys_read calls VFS functions such as vfs_read and generic_file_read. generic_file_read checks the page cache; if the data is cached, it returns immediately.
If not cached, the kernel allocates a new page frame in the page cache and triggers a page‑fault.
The kernel issues an I/O request to the generic block layer, abstracting disks, USB drives, etc.
The block layer wraps the request in a bio and places it on the I/O queue.
The I/O scheduler orders requests (e.g., using the elevator algorithm).
The driver sends a read command to the disk controller, using DMA to fill the new page frame.
The controller raises an interrupt upon completion.
The kernel copies the data from the page cache to user memory.
The waiting process is awakened and receives the data.
Performance Considerations
Disk seek time is relatively slow (~10 ms per seek, allowing only 100–200 seeks per second). Rotational speed also matters; a 15,000 rpm drive makes ~500 revolutions per second, but the head may not keep up, requiring extra rotations.
Typical sequential write speeds are 0–30 MB/s for HDDs and up to 100 MB/s for high‑performance drives; sequential reads range from 0–50 MB/s for HDDs and 0–400 MB/s for SSDs.
References
http://blog.chinaunix.net/uid-27105712-id-3270102.html?page=2
https://zhuanlan.zhihu.com/p/138371910
https://meik2333.com/posts/linux-many-proc-write-file/
https://blog.csdn.net/qq_43648751/article/details/104151401
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
