Understanding Linux Kernel I/O Mechanisms, File Systems, and Block I/O Scheduling
This article provides a comprehensive overview of Linux kernel I/O mechanisms, including file system interfaces, blocking and non‑blocking I/O, asynchronous models, multiplexing with select/epoll, EXT file‑system structures, VFS abstraction, consistency and journaling, as well as the complete block I/O path, scheduling algorithms, and debugging tools.
Linux kernel I/O mechanisms comprise a set of techniques and algorithms that enable efficient handling of input/output operations, ranging from file system interfaces and file descriptors to blocking, non‑blocking, asynchronous, and multiplexed I/O models.
1. I/O Models
The kernel supports several I/O models:
File system access through various virtual and real file systems (e.g., procfs, sysfs).
File descriptors that uniquely identify opened files.
Blocking I/O, where the calling process sleeps until the operation completes.
Non‑blocking I/O, which returns immediately with an error if data is not ready.
Asynchronous I/O, where the kernel notifies the application upon completion.
Multiplexing (select, poll, epoll) to monitor many descriptors simultaneously.
1.1 Blocking vs Non‑blocking
Blocking calls suspend the process until the I/O condition is satisfied; signals can interrupt a blocked call if SA_RESTART is not set. Non‑blocking calls return instantly when the device is not ready, which is rarely used in production code.
1.2 Multiplexing
Linux provides select() , poll() , and the more scalable epoll() . select iteratively checks which descriptors are ready, while epoll separates registration from event notification, reducing overhead for large numbers of descriptors.
1.3 Asynchronous I/O
Two main implementations exist: Glibc‑AIO (user‑space threads) and Kernel‑AIO (direct kernel threads), both allowing CPU and I/O to progress in parallel.
1.4 libevent
Libevent offers a cross‑platform event‑driven API similar to Qt/VC callbacks. It is compiled with gcc xxx.c -levent and abstracts underlying system calls.
2. EXT File System
Linux uses a Virtual File System (VFS) layer to provide a uniform interface to multiple on‑disk formats (ext2/3/4, FAT, NTFS, etc.). An EXT2/3/4 partition consists of:
Boot block (1 KB, reserved by the PC standard).
Superblock describing global filesystem parameters.
Group Descriptor Table (GDT) with per‑group metadata.
Block bitmap and inode bitmap indicating free blocks/inodes.
Inode table storing file metadata (type, permissions, timestamps, block pointers).
Data blocks holding file contents, directories, symbolic links, and special files.
Inodes contain 15 block pointers: 12 direct, one single‑indirect, one double‑indirect, and one triple‑indirect, enabling files up to 16 GB (with 4 KB blocks) to be addressed.
2.1 Example: Creating and Inspecting a 1 MiB EXT2 Image
dd if=/dev/zero of=fs count=256 bs=4kFormat the image:
mkfs.ext2 -b 1024 fsMount it via loop device:
sudo mount -o loop fs /mntInspect metadata with dumpe2fs and debugfs to view superblock, group descriptors, block/inode bitmaps, and directory entries.
2.2 Consistency and Journaling
File‑system operations are non‑atomic; power loss can leave metadata inconsistent. Journaling records intended changes before they are applied, allowing recovery after crashes. EXT2/3/4 support three journaling modes:
data=journal : full data and metadata are journaled (slow but safest).
data=ordered : metadata is journaled, data is written first, then metadata (default for many distributions).
data=writeback : only metadata is journaled; data ordering is not guaranteed (fastest).
2.3 Copy‑On‑Write (COW) Filesystems
Btrfs implements COW: new blocks are allocated for updates, and metadata is atomically switched to point to them, providing consistency without a traditional journal and enabling snapshots and subvolumes.
3. Block I/O Flow and Scheduling
3.1 From Page Cache to BIO to Request
When an application reads a file, the kernel checks the page cache. A cache miss triggers a read‑page operation that creates a bio , which is then converted into a request and placed into the per‑CPU plug queue.
3.2 Elevator (I/O Scheduler)
The plug queue flushes into the elevator queue, where requests are merged, sorted, and prioritized (QoS). The scheduler then dispatches requests to the device driver’s request_queue , which finally issues hardware commands.
3.3 Direct I/O (O_DIRECT) and Synchronous I/O (O_SYNC)
O_DIRECT bypasses the page cache, requiring page‑aligned buffers (allocated with posix_memalign ). O_SYNC uses the page cache but forces an immediate write‑back to disk.
3.4 I/O Scheduler Algorithms
Common schedulers include noop , deadline , and cfq . CFQ (Completely Fair Queuing) mimics process scheduling, while ionice can set class and priority for individual processes.
3.5 Cgroups and I/O Throttling
cgroup blkio controller can assign weight ( blkio.weight ) and throttle rates ( blkio.throttle.read_bps_device ) to limit I/O bandwidth per group, applicable to direct I/O operations.
3.6 Debugging Tools
Use ftrace to trace VFS functions, blktrace / blkparse for block‑level events, iotop / iostat for runtime statistics, and debugfs commands ( stat , icheck , ncheck ) to inspect inodes and block mappings.
4. Summary
The Linux kernel provides a rich set of I/O mechanisms—from simple blocking reads to sophisticated asynchronous and multiplexed models—supported by a flexible VFS layer, robust filesystem structures, journaling for consistency, and a highly tunable block I/O pipeline that can be monitored and controlled with a variety of debugging and cgroup tools.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.