Demystifying Linux I/O: From VFS and Inodes to ZFS and Block Layer
This article explains how Linux handles I/O operations, covering the virtual file system, inode and dentry structures, superblock layout, ZFS features, disk types, the generic block layer, I/O scheduling strategies, and key performance metrics for storage.
File System
What is a file system
File systems are mechanisms that organize and manage files on storage devices; different organization methods produce different file systems such as Ext4, XFS, ZFS, and NFS.
Application developers usually interact only with system calls like open, read, write, and close, without worrying about the underlying file system type, disk interface, or storage medium.
How the file system works (VFS)
Linux files
In Linux, everything is a file, including regular files, directories, block devices, sockets, and pipes.
brw-r--r-- 1 root root 1, 2 Apr 25 11:03 bnod // block device file
crw-r--r-- 1 root root 1, 2 Apr 25 11:04 cnod // character device file
drwxr-xr-x 2 user user 6 Apr 25 11:01 dir // directory
-rw-r--r-- 1 user user 0 Apr 25 11:01 file // regular file
prw-r--r-- 1 root root 0 Apr 25 11:04 pipeline // named pipe
srwxr-xr-x 1 root root 0 Apr 25 11:06 socket.sock // socket file
lrwxrwxrwx 1 root root 4 Apr 25 11:04 softlink -> file // symbolic link
-rw-r--r-- 2 user user 0 Apr 25 11:07 hardlink // hard link (also a regular file)inode (index node): stores metadata such as inode number, size, permissions, timestamps, and data location.
dentry (directory entry): stores the file name, inode pointer, and directory hierarchy.
inode and dentry
Inode records a file's metadata; it is persisted on disk and occupies space.
stat file
File: file
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: fe21h/65057d Inode: 32828 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 3041/ user) Gid: ( 3041/ user)
Access: 2021-04-25 11:07:59.603745534 +0800
Modify: 2021-04-25 11:07:59.603745534 +0800
Change: 2021-04-25 11:08:04.739848692 +0800
Birth: -Dentry keeps the file name, the inode pointer, and the relationship to other dentries, forming the directory tree. Dentry is maintained in memory (dentry cache).
tree
.
├── dir
│ └── file_in_dir
├── file
└── hardlinkZFS
ZFS is a widely used file system; many database applications rely on it.
Typical ZFS hierarchy:
ZFS operations
Create zpool
root@:~ # zpool create tank raidz /dev/ada1 /dev/ada2 /dev/ada3 raidz /dev/ada4 /dev/ada5 /dev/ada6
root@:~ # zpool list tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 11G 824K 11.0G - - 0% 0% 1.00x ONLINE -
root@:~ # zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
ada4 ONLINE 0 0 0
ada5 ONLINE 0 0 0
ada6 ONLINE 0 0 0Created a zpool named tank using RAID‑Z (RAID5‑like) layout.
Create ZFS filesystem
root@:~ # zfs create -o mountpoint=/mnt/srev tank/srev
root@:~ # df -h tank/srev
Filesystem Size Used Avail Capacity Mounted on
tank/srev 7.1G 117K 7.1G 0% /mnt/srevMounted the ZFS filesystem at /mnt/srev with size equal to the zpool.
Set ZFS quota
root@:~ # zfs set quota=1G tank/srev
root@:~ # df -h tank/srev
Filesystem Size Used Avail Capacity Mounted on
tank/srev 1.0G 118K 1.0G 0% /mnt/srevZFS features
Pool storage : zpool can be expanded dynamically, and multiple filesystems share the same pool without pre‑allocation.
Transactional filesystem : writes are atomic (copy‑on‑write), preventing partial writes after power loss.
ARC cache : Adaptive Replacement Cache balances LRU and LFU based on workload, using four lists (LRU, LFU, LRU ghost, LFU ghost).
Disk Types
Storage media
HDD (mechanical hard drive)
SSD (solid‑state drive)
Interfaces
IDE
SCSI
SAS
SATA
Linux disk management
Disks appear as block devices with major/minor numbers; e.g., /dev/sda has major number 8 indicating an sd‑type block device.
ls -l /dev/sda*
brw-rw---- 1 root disk 8, 0 Apr 25 15:53 /dev/sda
brw-rw---- 1 root disk 8, 1 Apr 25 15:53 /dev/sda1
...Generic Block Layer
The Generic Block Layer abstracts heterogeneous block devices for the VFS and provides a unified framework for drivers and I/O scheduling.
I/O Scheduling
Classic single‑queue schedulers:
NOOP – simple FIFO with basic request merging.
CFQ – Completely Fair Queueing, gives each process a fair share.
Deadline – prioritises requests that approach their deadline.
Multi‑queue (blk‑mq) schedulers:
BFQ – Budget Fair Queueing, allocates bandwidth based on request size.
Kyber – maintains separate sync/async queues and limits outstanding requests.
mq‑deadline – multi‑queue version of Deadline.
Performance Metrics
Common I/O performance indicators:
Utilisation (ioutil) – percentage of time the disk spends handling I/O.
IOPS – number of I/O operations per second.
Throughput/Bandwidth – amount of data transferred per second (MB/s or GB/s).
Latency – time from issuing an I/O request to receiving a response.
Saturation – overall busy level of the disk, often inferred from queue length or latency.
Typical monitoring commands: iostat -d -x – shows per‑device I/O statistics. pidstat -d – shows I/O of individual processes. iotop – interactive view of processes sorted by I/O usage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
