Understanding Linux I/O: From Buffers to Disk Writes
This article provides a comprehensive overview of Linux I/O fundamentals, covering the layered I/O stack, buffer interactions, system call flow, scheduler algorithms, consistency and safety considerations, and performance characteristics, supplemented with code examples and illustrative diagrams.
1. Introduction
Linux I/O is the foundation of file storage. This article summarizes basic Linux I/O concepts.
2. Linux I/O Stack
The Linux file I/O uses a layered design, providing clear architecture and functional decoupling.
When data is written, it passes through several buffers: the application buffer, the libc (standard I/O) buffer, the page cache, and finally the kernel buffer before reaching the disk.
Example code:
<code>void foo() {
char *buf = malloc(MAX_SIZE);
strncpy(buf, src, MAX_SIZE);
fwrite(buf, MAX_SIZE, 1, fp);
fclose(fp);
}</code>After fwrite , the OS copies data from the application buffer to the libc buffer. fclose flushes the libc buffer to the page cache, but data may still be lost if the process exits before the kernel flushes it to disk. To ensure persistence, sync or fsync must be used. fflush only moves data from the libc buffer to the page cache.
3. I/O Call Chain
fwrite is the highest‑level interface; it buffers data in user space and eventually invokes the write system call, causing a user‑to‑kernel transition. The data reaches the page cache, after which the kernel’s pdflush thread schedules it for writing to the I/O queue.
4. I/O Scheduler Layer
Tasks in the I/O queue are scheduled to maximize overall disk performance. Traditional algorithms (elevator, deadline) aim to reduce head movement on mechanical drives. SSDs lack moving parts, so the noop algorithm is often preferred.
5. Consistency and Safety
5.1 Safety
If a process exits, data in the application or libc buffer is lost; data already in the page cache survives. A kernel crash loses data not yet in the disk cache, and a power loss loses all data.
5.2 Consistency
Opening the same file multiple times in one process without O_APPEND causes writes to overlap because each file descriptor has its own file offset.
<code>fd1 = open("file", O_RDWR|O_TRUNC);
fd2 = open("file", O_RDWR|O_TRUNC);
while (1) {
write(fd1, "hello \n", 6);
write(fd2, "world \n", 6);
}</code>Using O_APPEND shares the file length and updates each descriptor’s offset, preventing overwrites.
5.3 Read Process
The read path proceeds as: lib read → sys_read → VFS vfs_read / generic_file_read → page cache check → block layer → I/O scheduler → driver → DMA → disk → user buffer.
6. Performance Issues
Disk seek time (~10 ms) limits seeks to about 100‑200 per second. Rotational speed affects throughput; typical 15,000 rpm drives achieve 0‑50 MB/s sequential reads, while SSDs can reach 0‑400 MB/s.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.