Fundamentals 29 min read

Understanding Linux Kernel I/O Mechanisms, File Systems, and Block I/O Scheduling

This article provides a comprehensive overview of Linux kernel I/O mechanisms, including file system interfaces, blocking and non‑blocking I/O, asynchronous models, multiplexing with select/epoll, EXT file‑system structures, VFS abstraction, consistency and journaling, as well as the complete block I/O path, scheduling algorithms, and debugging tools.

Deepin Linux

Oct 27, 2023

Understanding Linux Kernel I/O Mechanisms, File Systems, and Block I/O Scheduling

Linux kernel I/O mechanisms comprise a set of techniques and algorithms that enable efficient handling of input/output operations, ranging from file system interfaces and file descriptors to blocking, non‑blocking, asynchronous, and multiplexed I/O models.

1. I/O Models

The kernel supports several I/O models:

File system access through various virtual and real file systems (e.g., procfs, sysfs).

File descriptors that uniquely identify opened files.

Blocking I/O, where the calling process sleeps until the operation completes.

Non‑blocking I/O, which returns immediately with an error if data is not ready.

Asynchronous I/O, where the kernel notifies the application upon completion.

Multiplexing (select, poll, epoll) to monitor many descriptors simultaneously.

1.1 Blocking vs Non‑blocking

Blocking calls suspend the process until the I/O condition is satisfied; signals can interrupt a blocked call if SA_RESTART is not set. Non‑blocking calls return instantly when the device is not ready, which is rarely used in production code.

1.2 Multiplexing

Linux provides select(), poll(), and the more scalable epoll(). select iteratively checks which descriptors are ready, while epoll separates registration from event notification, reducing overhead for large numbers of descriptors.

1.3 Asynchronous I/O

Two main implementations exist: Glibc‑AIO (user‑space threads) and Kernel‑AIO (direct kernel threads), both allowing CPU and I/O to progress in parallel.

1.4 libevent

Libevent offers a cross‑platform event‑driven API similar to Qt/VC callbacks. It is compiled with gcc xxx.c -levent and abstracts underlying system calls.

2. EXT File System

Linux uses a Virtual File System (VFS) layer to provide a uniform interface to multiple on‑disk formats (ext2/3/4, FAT, NTFS, etc.). An EXT2/3/4 partition consists of:

Boot block (1 KB, reserved by the PC standard).

Superblock describing global filesystem parameters.

Group Descriptor Table (GDT) with per‑group metadata.

Block bitmap and inode bitmap indicating free blocks/inodes.

Inode table storing file metadata (type, permissions, timestamps, block pointers).

Data blocks holding file contents, directories, symbolic links, and special files.

Inodes contain 15 block pointers: 12 direct, one single‑indirect, one double‑indirect, and one triple‑indirect, enabling files up to 16 GB (with 4 KB blocks) to be addressed.

2.1 Example: Creating and Inspecting a 1 MiB EXT2 Image

dd if=/dev/zero of=fs count=256 bs=4k

Format the image: mkfs.ext2 -b 1024 fs Mount it via loop device: sudo mount -o loop fs /mnt Inspect metadata with dumpe2fs and debugfs to view superblock, group descriptors, block/inode bitmaps, and directory entries.

2.2 Consistency and Journaling

File‑system operations are non‑atomic; power loss can leave metadata inconsistent. Journaling records intended changes before they are applied, allowing recovery after crashes. EXT2/3/4 support three journaling modes:

data=journal : full data and metadata are journaled (slow but safest).

data=ordered : metadata is journaled, data is written first, then metadata (default for many distributions).

data=writeback : only metadata is journaled; data ordering is not guaranteed (fastest).

2.3 Copy‑On‑Write (COW) Filesystems

Btrfs implements COW: new blocks are allocated for updates, and metadata is atomically switched to point to them, providing consistency without a traditional journal and enabling snapshots and subvolumes.

3. Block I/O Flow and Scheduling

3.1 From Page Cache to BIO to Request

When an application reads a file, the kernel checks the page cache. A cache miss triggers a read‑page operation that creates a bio, which is then converted into a request and placed into the per‑CPU plug queue.

3.2 Elevator (I/O Scheduler)

The plug queue flushes into the elevator queue, where requests are merged, sorted, and prioritized (QoS). The scheduler then dispatches requests to the device driver’s request_queue, which finally issues hardware commands.

3.3 Direct I/O (O_DIRECT) and Synchronous I/O (O_SYNC)

O_DIRECT

bypasses the page cache, requiring page‑aligned buffers (allocated with posix_memalign). O_SYNC uses the page cache but forces an immediate write‑back to disk.

3.4 I/O Scheduler Algorithms

Common schedulers include noop , deadline , and cfq . CFQ (Completely Fair Queuing) mimics process scheduling, while ionice can set class and priority for individual processes.

3.5 Cgroups and I/O Throttling

cgroup blkio controller can assign weight ( blkio.weight) and throttle rates ( blkio.throttle.read_bps_device) to limit I/O bandwidth per group, applicable to direct I/O operations.

3.6 Debugging Tools

Use ftrace to trace VFS functions, blktrace / blkparse for block‑level events, iotop / iostat for runtime statistics, and debugfs commands ( stat, icheck, ncheck) to inspect inodes and block mappings.

4. Summary

The Linux kernel provides a rich set of I/O mechanisms—from simple blocking reads to sophisticated asynchronous and multiplexed models—supported by a flexible VFS layer, robust filesystem structures, journaling for consistency, and a highly tunable block I/O pipeline that can be monitored and controlled with a variety of debugging and cgroup tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

I/O Linux file system vfs Block I/O journaling

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. I/O Models

1.1 Blocking vs Non‑blocking

1.2 Multiplexing

1.3 Asynchronous I/O

1.4 libevent

2. EXT File System

2.1 Example: Creating and Inspecting a 1 MiB EXT2 Image

2.2 Consistency and Journaling

2.3 Copy‑On‑Write (COW) Filesystems

3. Block I/O Flow and Scheduling

3.1 From Page Cache to BIO to Request

3.2 Elevator (I/O Scheduler)

3.3 Direct I/O (O_DIRECT) and Synchronous I/O (O_SYNC)

3.4 I/O Scheduler Algorithms

3.5 Cgroups and I/O Throttling

3.6 Debugging Tools

4. Summary

Deepin Linux

How this landed with the community

Was this worth your time?

0 Comments

2.1 Example: Creating and Inspecting a 1 MiB EXT2 Image