Understanding the Linux Block Layer: Multi‑Queue Architecture and I/O Schedulers
The Linux block layer, positioned between the VFS and device drivers, evolved from a single‑queue design to a multi‑queue architecture (kernel 3.13 onward) to reduce lock contention and exploit modern hardware queues, managing bio/request lifecycles, dispatching through layered queues, and employing various I/O schedulers for balanced throughput and latency.
Early Linux block frameworks used a single‑queue architecture that matched the capabilities of hardware with a single I/O queue (e.g., mechanical disks). As storage devices evolved to support multiple hardware queues (e.g., NVMe SSDs), the block layer transitioned to a multi‑queue design. Multi‑queue support was first added in kernel 3.13, became stable over several years, and is the default in Linux 5.0+.
The block layer sits between the virtual file system (VFS) and device drivers. When a user issues a read/write, the request traverses VFS → file system → block layer → driver → device. After the device finishes processing, it generates an interrupt that notifies the driver, and the block layer’s soft‑irq handles completion.
Functions of the block layer
Management of I/O requests: buffering, merging, and ordering of requests. This includes handling both single‑queue and multi‑queue frameworks and the various I/O schedulers.
I/O statistics: tracking per‑process I/O via struct task_io_accounting .
Dispatching requests to device drivers: drivers pull requests from the block layer’s dispatch queue and translate them into device‑specific commands ( cmd ).
IO request lifecycle
An I/O operation is represented by a bio , which may be merged or split, then transformed into a request (rq). Multiple bio objects can be merged into a single request . The request is eventually turned into a cmd that the driver sends to the hardware.
The bio structure follows the POSIX scatter‑gather model, describing a list of bio_vector entries (page address, offset, length). The request is the unit scheduled by the block layer; its number is limited (default q->nr_requests = BLKDEV_MAX_RQ = 128 ). When the number of pending requests exceeds 7/8 * q->nr_requests , the queue enters a congestion state and new request generation is throttled.
Single‑queue vs. Multi‑queue
Single‑queue systems have one software queue protected by request_queue->queue_lock . This design incurs lock contention, two interrupts per I/O (hardware + IPI), and remote memory accesses when the submitting CPU differs from the interrupt‑handling CPU.
Multi‑queue assigns a software queue to each CPU (soft context dispatch queue) and maps them to hardware queues (hard context dispatch queues). This reduces contention and improves parallelism, especially on devices with many hardware queues.
Data structures
The block layer uses two main categories of structures: I/O requests ( bio , request , cmd ) and the queues that manage them. The bio is the smallest unit, describing the memory location of the I/O. The request aggregates one or more bio objects and is the unit scheduled by the elevator.
Queue hierarchy
All bio objects are submitted via submit_bio and pass through several queues:
Process‑private plug list – a lock‑free buffer for short‑term merging.
Scheduler queue (elevator q) – implements policies such as noop, deadline, CFQ for single‑queue, and mq‑deadline, BFQ, Kyber for multi‑queue.
Device dispatch queue – software queue from which the driver pulls requests.
Hardware queue (HW q) – the actual device‑side queue (e.g., NVMe supports up to 64K queues).
Common I/O schedulers
Single‑queue schedulers: noop, deadline, CFQ. Multi‑queue schedulers: none (noop‑like), mq‑deadline, BFQ, Kyber.
Noop simply FIFO‑queues requests, performing only forward/backward sector merging.
CFQ (Completely Fair Queuing) gives each process a virtual time slice proportional to its priority, using a red‑black tree to select the next request.
BFQ (Budget Fair Queuing) allocates a budget (in sectors) rather than time, ensuring fairness based on data volume and improving interactive‑process responsiveness.
Conclusion
The block layer has evolved from a single‑queue to a multi‑queue architecture to keep pace with faster storage devices. Modern schedulers balance throughput and latency, providing better user experience on both interactive and batch workloads.
OPPO Kernel Craftsman
Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.