Understanding Linux Disk I/O, Page Cache, and File Operation Mechanisms
This article explains the hierarchical storage pyramid, Linux kernel I/O stack, page‑cache synchronization policies, the atomicity of write operations, the role of mmap and Direct I/O, and how disk characteristics influence multithreaded read/write performance and design decisions.
Before diving into the discussion, the author poses several questions about HDD vs SSD differences, multithreaded file writes, write durability after power loss, atomicity of write calls, and the performance claims of mmap.
The storage hierarchy is presented as a pyramid where faster, more expensive memory (registers, CPU caches, DRAM) sits above slower, cheaper storage (local disks), and each layer typically serves as a cache for the one below.
When a program invokes file‑operation functions, user‑space stdio buffers the data, then the kernel’s page cache (also called the buffer cache) stores file contents, while the device’s own buffer cache stores raw block data. The article focuses on the kernel‑level caches, especially the page cache.
The Linux I/O stack consists of three layers: the file‑system layer (where write copies data into the file‑system cache), the block layer (which merges and schedules I/O requests), and the device layer (which uses DMA to transfer data to the hardware).
Different I/O mechanisms occupy different positions in this stack: traditional buffered I/O goes through both the page cache and the block layer, mmap maps the page cache directly into user space eliminating the second copy, and Direct I/O bypasses the page cache entirely, copying data straight between user buffers and the device.
Page‑cache synchronization follows either write‑through (immediate flush) or write‑back (asynchronous flush). Linux defaults to write‑back, marking modified pages as dirty and flushing them based on memory pressure, time‑outs, or explicit sync/fsync/fdatasync calls. The relevant kernel parameters are shown below:
# Flush every 5 seconds
root@082caa3dfb1d / $ sysctl vm.dirty_writeback_centisecs
vm.dirty_writeback_centisecs = 500
# Dirty pages older than 30 seconds are flushed on the next run
root@082caa3dfb1d / $ sysctl vm.dirty_expire_centisecs
vm.dirty_expire_centisecs = 3000
# Flush when dirty pages exceed 10% of RAM
root@082caa3dfb1d / $ sysctl vm.dirty_background_ratio
vm.dirty_background_ratio = 10To enforce write‑through semantics for a specific file, open it with the O_SYNC flag or use fsync after writes.
Regarding durability after power loss, using O_SYNC or fsync only guarantees data reaches the disk’s cache; if the disk’s own cache is not disabled (e.g., via hdparm -W0 ), a sudden power loss can still cause data loss.
When multiple threads write to the same file, write() is not atomic; only O_CREAT and O_APPEND are guaranteed atomic by the kernel. The atomicity of O_APPEND combined with write is debated, with no definitive answer in the Linux documentation.
Linux provides two file‑locking mechanisms: flock (BSD style) and fcntl (System V style). In practice, developers often avoid concurrent writes by using application‑level mutexes or by designing systems (e.g., databases) that manage their own logging and caching.
Disk performance testing is essential for systems with heavy I/O. Mechanical HDDs suffer from high seek latency, making random I/O slow, while SSDs handle random accesses efficiently and benefit from high I/O depth (often 32‑64 concurrent threads). Tools like fio are recommended for measuring IOPS, latency, and throughput.
Finally, the article emphasizes that storage characteristics should guide software design: avoid random I/O on HDDs, batch writes on SSDs, and consider using techniques such as log‑structured storage (e.g., LevelDB) to align with underlying hardware behavior.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.