Mastering Linux File I/O: Layers, Calls, and Performance Tweaks
This article breaks down Linux file I/O from high‑level architecture to low‑level system calls, explains how data moves through application buffers, C library buffers, page cache and disk, and offers practical tips for improving throughput, consistency, and safety.
1. Writing Files Across Layers
When an application writes data, it typically allocates an application buffer (e.g., with malloc), copies data into it, and then calls fwrite. fwrite copies the data into the C library I/O buffer; the data remains there until fclose flushes it to the page cache, or fflush forces a copy to the page cache without reaching the disk. The final write to the physical medium occurs only when fsync or fclose triggers a flush from the page cache to the disk.
<code>char *buf = malloc(MAX_BUF_SIZE); strncpy(buf, src, MAX_BUF_SIZE); fwrite(buf, MAX_BUF_SIZE, 1, fp); fclose(fp); </code>
Direct I/O can bypass the C library buffer by using the low‑level write system call, which copies data from the application buffer straight to the page cache. Mapping the page cache into user space with mmap eliminates the system‑call overhead entirely. For raw disk access, opening a file with O_DIRECT or using RAW device tools (e.g., dd, cpio) writes directly to the device, skipping the filesystem cache.
2. I/O Call Chain
The typical call chain starts with fwrite, which buffers small writes, merges them, and eventually invokes write. The write system call copies data from user space to kernel space, causing a user‑mode/kernel‑mode transition. After reaching the page cache, the kernel schedules the data for asynchronous write‑back; the actual disk write is performed by the I/O scheduler and the pdflush thread. Adding O_SYNC (or opening with O_SYNC) makes the write synchronous, forcing the data to be flushed before the call returns.
Once data is in the page cache, the kernel’s I/O scheduler decides when to issue the physical write based on algorithms such as CFQ, deadline, or noop (the latter is preferred for SSDs). The scheduler also merges adjacent sectors and orders writes to minimize head movement on spinning disks.
3. Consistency and Safety
If a process crashes while data is still in the application or C library buffers, the data is lost. Data already in the page cache survives a process crash but is lost if the kernel crashes before it reaches the disk. Power loss can cause loss unless fsync has been called.
Concurrent writes have specific guarantees: writes smaller than PIPE_BUF (typically 4096 bytes) are atomic, and using O_APPEND ensures each write appends atomically. For multi‑threaded access within the same process, explicit locking is required because the C library buffer is private to each process.
4. Performance Considerations
Disk performance is limited by mechanical factors (seek time ~10 ms, rotation speed up to 15 000 rpm) and by the I/O scheduler. SSDs eliminate seek latency, making complex scheduling less beneficial; the simple noop scheduler is often optimal. Typical throughput figures are ~30 MiB/s sequential write for HDDs, ~50 MiB/s read, and up to 400 MiB/s for SSDs.
Improving performance can involve:
Using larger, aligned I/O buffers to reduce copy overhead.
Employing O_DIRECT or mmap to bypass the page cache when appropriate.
Choosing an I/O scheduler suited to the storage medium.
Parallelizing I/O across multiple disks.
5. P.S. O_DIRECT vs. RAW Device
O_DIRECT works through the filesystem: the application deals with file handles, the kernel translates operations to inodes and data blocks, and the final on‑disk format follows the underlying filesystem (e.g., ext3).
RAW device access bypasses any filesystem; the application reads/writes raw sectors. Data written this way may not correspond to any filesystem structure and cannot be mounted on another system without a custom driver.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
