Understanding the Linux I/O Stack, Call Chain, and Performance Characteristics
This article explains the layered design of Linux file I/O, describes how data moves from application buffers through libc, page cache, and kernel I/O queues to disk, discusses synchronization primitives, consistency issues, and performance factors such as scheduling algorithms and hardware characteristics.
Linux I/O is the foundation of file storage, and this guide summarizes its basic concepts, including the layered architecture that provides clear structure and functional decoupling.
The I/O stack consists of multiple layers: the application buffer, the libc (standard I/O) buffer, the page cache, the kernel I/O scheduler, and finally the device driver that transfers data to the disk cache. Each layer adds a copy step, which improves modularity but also introduces latency.
Typical usage with fwrite creates an application buffer, copies data to the libc buffer, and then flushes it to the page cache. The following example demonstrates this flow:
void foo() {
char *buf = malloc(MAX_SIZE);
strncpy(buf, src, MAX_SIZE);
fwrite(buf, MAX_SIZE, 1, fp);
fclose(fp);
}Calling fclose only flushes the libc buffer to the page cache; to ensure data reaches the disk, one must also flush the kernel buffers using sync or fsync . The fflush function only moves data from the libc buffer to the page cache, while fsync forces the page cache to be written to the physical medium.
Direct writes can bypass the page cache by opening a file with the O_DIRECT flag, and raw device writes (e.g., using dd ) bypass the filesystem entirely.
The I/O call chain shows that write performs a system call that copies data from the application buffer directly to the page cache, triggering a user‑to‑kernel mode switch. The kernel’s pdflush threads later move dirty pages from the page cache to the I/O scheduler queue, where algorithms such as noop , deadline , or cfq decide when to issue the actual disk operations.
On SSDs, the noop scheduler is often preferred because there is no mechanical seek time, whereas traditional HDDs benefit from elevator‑style scheduling that reduces head movement.
Consistency and safety considerations include data loss scenarios: data in the application or libc buffers is lost if the process exits; data in the page cache survives a process exit but can be lost if the kernel crashes or the machine powers off before the kernel flushes it to disk. Using O_SYNC or fsync mitigates these risks.
When multiple file descriptors write to the same file, each descriptor maintains its own file offset, leading to overwrites unless the O_APPEND flag is used, which forces each write to append at the current end of the file.
Performance bottlenecks stem mainly from disk seek time (≈10 ms per seek) and rotational speed (e.g., 15 000 rpm yields ~500 rotations per second). Typical sequential write speeds are 0–30 MiB/s for HDDs and up to 400 MiB/s for SSDs.
References: blog.chinaunix.net, zhihu.com, meik2333.com, csdn.net.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.