Fundamentals 29 min read

Understanding Linux Kernel I/O Mechanisms, File Systems, and Block I/O Scheduling

This article provides a comprehensive overview of Linux kernel I/O mechanisms, including file system interfaces, blocking and non‑blocking I/O, asynchronous models, multiplexing with select/epoll, EXT file‑system structures, VFS abstraction, consistency and journaling, as well as the complete block I/O path, scheduling algorithms, and debugging tools.

Deepin Linux
Deepin Linux
Deepin Linux
Understanding Linux Kernel I/O Mechanisms, File Systems, and Block I/O Scheduling

Linux kernel I/O mechanisms comprise a set of techniques and algorithms that enable efficient handling of input/output operations, ranging from file system interfaces and file descriptors to blocking, non‑blocking, asynchronous, and multiplexed I/O models.

1. I/O Models

The kernel supports several I/O models:

File system access through various virtual and real file systems (e.g., procfs, sysfs).

File descriptors that uniquely identify opened files.

Blocking I/O, where the calling process sleeps until the operation completes.

Non‑blocking I/O, which returns immediately with an error if data is not ready.

Asynchronous I/O, where the kernel notifies the application upon completion.

Multiplexing (select, poll, epoll) to monitor many descriptors simultaneously.

1.1 Blocking vs Non‑blocking

Blocking calls suspend the process until the I/O condition is satisfied; signals can interrupt a blocked call if SA_RESTART is not set. Non‑blocking calls return instantly when the device is not ready, which is rarely used in production code.

1.2 Multiplexing

Linux provides select() , poll() , and the more scalable epoll() . select iteratively checks which descriptors are ready, while epoll separates registration from event notification, reducing overhead for large numbers of descriptors.

1.3 Asynchronous I/O

Two main implementations exist: Glibc‑AIO (user‑space threads) and Kernel‑AIO (direct kernel threads), both allowing CPU and I/O to progress in parallel.

1.4 libevent

Libevent offers a cross‑platform event‑driven API similar to Qt/VC callbacks. It is compiled with gcc xxx.c -levent and abstracts underlying system calls.

2. EXT File System

Linux uses a Virtual File System (VFS) layer to provide a uniform interface to multiple on‑disk formats (ext2/3/4, FAT, NTFS, etc.). An EXT2/3/4 partition consists of:

Boot block (1 KB, reserved by the PC standard).

Superblock describing global filesystem parameters.

Group Descriptor Table (GDT) with per‑group metadata.

Block bitmap and inode bitmap indicating free blocks/inodes.

Inode table storing file metadata (type, permissions, timestamps, block pointers).

Data blocks holding file contents, directories, symbolic links, and special files.

Inodes contain 15 block pointers: 12 direct, one single‑indirect, one double‑indirect, and one triple‑indirect, enabling files up to 16 GB (with 4 KB blocks) to be addressed.

2.1 Example: Creating and Inspecting a 1 MiB EXT2 Image

dd if=/dev/zero of=fs count=256 bs=4k

Format the image:

mkfs.ext2 -b 1024 fs

Mount it via loop device:

sudo mount -o loop fs /mnt

Inspect metadata with dumpe2fs and debugfs to view superblock, group descriptors, block/inode bitmaps, and directory entries.

2.2 Consistency and Journaling

File‑system operations are non‑atomic; power loss can leave metadata inconsistent. Journaling records intended changes before they are applied, allowing recovery after crashes. EXT2/3/4 support three journaling modes:

data=journal : full data and metadata are journaled (slow but safest).

data=ordered : metadata is journaled, data is written first, then metadata (default for many distributions).

data=writeback : only metadata is journaled; data ordering is not guaranteed (fastest).

2.3 Copy‑On‑Write (COW) Filesystems

Btrfs implements COW: new blocks are allocated for updates, and metadata is atomically switched to point to them, providing consistency without a traditional journal and enabling snapshots and subvolumes.

3. Block I/O Flow and Scheduling

3.1 From Page Cache to BIO to Request

When an application reads a file, the kernel checks the page cache. A cache miss triggers a read‑page operation that creates a bio , which is then converted into a request and placed into the per‑CPU plug queue.

3.2 Elevator (I/O Scheduler)

The plug queue flushes into the elevator queue, where requests are merged, sorted, and prioritized (QoS). The scheduler then dispatches requests to the device driver’s request_queue , which finally issues hardware commands.

3.3 Direct I/O (O_DIRECT) and Synchronous I/O (O_SYNC)

O_DIRECT bypasses the page cache, requiring page‑aligned buffers (allocated with posix_memalign ). O_SYNC uses the page cache but forces an immediate write‑back to disk.

3.4 I/O Scheduler Algorithms

Common schedulers include noop , deadline , and cfq . CFQ (Completely Fair Queuing) mimics process scheduling, while ionice can set class and priority for individual processes.

3.5 Cgroups and I/O Throttling

cgroup blkio controller can assign weight ( blkio.weight ) and throttle rates ( blkio.throttle.read_bps_device ) to limit I/O bandwidth per group, applicable to direct I/O operations.

3.6 Debugging Tools

Use ftrace to trace VFS functions, blktrace / blkparse for block‑level events, iotop / iostat for runtime statistics, and debugfs commands ( stat , icheck , ncheck ) to inspect inodes and block mappings.

4. Summary

The Linux kernel provides a rich set of I/O mechanisms—from simple blocking reads to sophisticated asynchronous and multiplexed models—supported by a flexible VFS layer, robust filesystem structures, journaling for consistency, and a highly tunable block I/O pipeline that can be monitored and controlled with a variety of debugging and cgroup tools.

kernelI/OLinuxFile SystemVFSBlock I/Ojournaling
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.