Operations 50 min read

Why Disk I/O Spikes Freeze Your System: A Deep Dive with iostat, iotop, blktrace, fio & bpftrace

When iowait jumps to 80% and commands stall, this guide walks through the Linux I/O stack, explains key iostat metrics, and shows how to pinpoint offending processes with iotop, then uses blktrace, btt, fio, and bpftrace to diagnose and tune storage performance.

Raymond Ops
Raymond Ops
Raymond Ops
Why Disk I/O Spikes Freeze Your System: A Deep Dive with iostat, iotop, blktrace, fio & bpftrace

Overview

High iowait on a Linux server often indicates a disk I/O bottleneck. The article walks through a complete, layered troubleshooting workflow – from discovery with iostat and iotop, through low‑level tracing with blktrace and btt, to performance baseline testing with fio and real‑time latency histograms with bpftrace. It also covers practical tuning of the I/O scheduler, readahead, dirty‑page settings and cgroup v2 limits.

Linux I/O Stack

An I/O request traverses the following layers:

Application → VFS (Page Cache) → Filesystem (ext4/xfs/btrfs) → Block Layer (merge, scheduler, multi‑queue) → Device driver (NVMe/SCSI/virtio‑blk) → Physical device (NVMe SSD / SATA SSD / HDD)

Each layer can become a bottleneck, so the analysis proceeds layer by layer.

VFS & Page Cache

If free -h shows a tiny buff/cache or sar -B reports high pgpgin, reads are bypassing the cache and hitting the disk directly.

Writes first go to dirty pages and are flushed asynchronously by pdflush / flush threads.

Filesystem Layer

Ext4 journal writes and XFS log writes add extra I/O.

Severe fragmentation turns sequential reads into random reads; use filefrag to inspect.

Block Layer

Visible to iostat. Requests are merged, sorted and scheduled.

Modern kernels use the multi‑queue block layer ( blk‑mq) with one software queue per CPU core.

Device Driver & Physical Device

NVMe devices expose many hardware queues (commonly 64 queues × 64 K depth).

SATA SSDs have a single NCQ queue (depth 32).

I/O Scheduler Selection

Linux 6.x provides four schedulers. Choose according to device type:

none – NVMe SSDs. No software scheduling; the hardware queue handles ordering.

mq-deadline – General‑purpose SSDs, HDDs, virtual cloud disks. Deadline‑based, read‑prefers‑write.

bfq – Desktop or mixed‑load workloads, especially multi‑tenant HDD scenarios. Provides per‑process fair bandwidth allocation.

kyber – Low‑latency SATA SSDs; target‑latency based auto‑adjusts queue depth (rarely used in production).

Persist the choice with a udev rule, for example:

# /etc/udev/rules.d/60-io-scheduler.rules
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"

Reload the rules with udevadm control --reload-rules && udevadm trigger.

iostat Deep Interpretation

Typical usage:

# Refresh every second, extended mode, MB units
iostat -xmt 1
# Show only specific devices
iostat -xmt -d nvme0n1 sda 1

Key fields (excerpt): r/s / w/s – IOPS. HDD >150 IOPS is suspicious; NVMe can sustain >100 k IOPS. rkB/s / wkB/s – Throughput. Compare with device rated bandwidth. r_await / w_await – Average latency (queue + service). Critical metric: HDD 5‑15 ms, SATA SSD 0.05‑0.5 ms, NVMe 0.01‑0.03 ms. aqu-sz – Average queue length; large values indicate the device cannot keep up. %util – Device busy time. Meaningful for HDD, misleading for SSD/NVMe. rareq-sz / wareq-sz – Average request size. 4‑8 KB → random I/O; ≥128 KB → sequential I/O.

The svctm field is deprecated in sysstat 12.x and should be ignored because it is derived from %util and becomes inaccurate on multi‑queue devices.

Practical decision matrix (copy‑pasteable):

# 1. await high + aqu‑sz high + %util high → Disk truly busy (common on HDD)
# 2. await high + aqu‑sz low + %util low → Single I/O slow, possible hardware issue or RAID degradation
# 3. await normal + w/s very high + wkB/s low → Many tiny writes; consider I/O merging or raise dirty_ratio
# 4. rrqm/s ≈ 0 + rareq‑sz very small → Pure random reads, Page Cache ineffective
# 5. w_await >> r_await → Write bottleneck; check fsync frequency and journal mode

Per‑Process I/O with iotop

Identify the processes responsible for the observed disk load:

# Requires root (taskstats interface)
sudo iotop -oP
# Faster C implementation (recommended)
sudo iotop‑c -oP

Sample output (truncated):

Total DISK READ: 125.50 M/s | Total DISK WRITE: 42.30 M/s
Actual DISK READ: 125.50 M/s | Actual DISK WRITE: 8.75 M/s
 TID  PRIO  USER   DISK READ  DISK WRITE  IO>  COMMAND
12847 be/4  mysql  98.20 M/s   5.60 M/s   82.35% mysqld …
3021  be/4  root   22.10 M/s   0.00 B/s   15.20% tar czf …
891   be/4  elastic 5.20 M/s  36.70 M/s   8.50% java …

Key columns:

Total DISK READ/WRITE – Raw I/O before the page cache.

Actual DISK READ/WRITE – I/O that actually reached the device (writes are lower because of caching).

IO> – Percentage of time the process spent waiting for I/O.

PRIO – I/O priority (e.g., be/4 = best‑effort class, level 4).

Adjusting Process I/O Priority with ionice

# Reduce priority of a noisy process (requires bfq or mq‑deadline)
sudo ionice -c 3 -p 3021   # class 3 = idle, runs only when disk is idle
# Start a low‑priority backup directly
sudo ionice -c 3 nice -n 19 tar czf /backup/db-$(date +%Y%m%d).tar.gz /var/lib/mysql
# Note: ionice has no effect with the "none" scheduler used by NVMe.

Scriptable Per‑Process Statistics with pidstat

# Show I/O every second for 10 samples
pidstat -d 1 10
# Focus on a specific PID
pidstat -d -p 12847 1
# Example fields: kB_rd/s, kB_wr/s, kB_ccwr/s (canceled writes), iodelay (ticks spent waiting)

Block‑Layer Tracing with blktrace + btt

When iostat and iotop cover ~80 % of cases, blktrace provides a full lifecycle view.

# Capture block‑layer events for 10 seconds on /dev/sda
sudo blktrace -d /dev/sda -w 10 -o trace
# Convert to human‑readable format
blkparse -i trace -o trace.txt
# Example line (read of 8 sectors = 4 KB)
#  8,0  1  1  0.000000000 12847 Q R 123456 + 8 [mysqld]
# Q→D queue time 15 µs, D→C device service 835 µs, total 850 µs

Summarize latency distribution with btt:

# Convert to binary format first (blkparse -d)
blkparse -i trace -d trace.bin
btt -i trace.bin
# Sample output (values in seconds)
# Q2C  MIN=0.000085 AVG=0.001250 MAX=0.025000 N=12847
# Q2D  MIN=0.000005 AVG=0.000018 MAX=0.000350 N=12847
# D2C  MIN=0.000080 AVG=0.001232 MAX=0.024800 N=12847
# Interpretation: most latency is in the hardware stage (D2C), so the bottleneck is the device.

Visualize with iowatcher (optional):

sudo apt install iowatcher
iowatcher -t trace -o io-pattern.svg

Disk Performance Benchmarking with fio

Establish the raw capability of the storage before tuning.

# Install fio
sudo apt install fio   # Debian/Ubuntu
sudo dnf install fio   # RHEL/Fedora

Core parameters (most important): --rw – I/O pattern (read, write, randread, randwrite, randrw). --bs – Block size (4k for DB random, 128k/1m for sequential). --iodepth – Queue depth (HDD 1‑4, SATA SSD 32, NVMe 64‑128). --ioengineio_uring (recommended on kernel 6.x) or libaio. --numjobs – Number of concurrent jobs (1‑4, increase for multi‑queue testing). --size – Test file size (≥2 × RAM to avoid cache effects). --direct=1 – Bypass page cache. --runtime – 60‑120 s (shorter runs are unstable).

Standard test scenarios (replace /dev/nvme0n1 with a file path for safety):

# Random read 4 KB (DB query simulation) on NVMe
fio --name=rand-read --ioengine=io_uring --rw=randread \
    --bs=4k --iodepth=64 --numjobs=4 --size=4G \
    --direct=1 --runtime=60 --group_reporting \
    --filename=/dev/nvme0n1

# Random write (DB writes)
fio --name=rand-write --ioengine=io_uring --rw=randwrite \
    --bs=4k --iodepth=64 --numjobs=4 --size=4G \
    --direct=1 --runtime=60 --group_reporting \
    --directory=/mnt/test

# Sequential read (large file scan)
fio --name=seq-read --ioengine=io_uring --rw=read \
    --bs=1m --iodepth=16 --numjobs=1 --size=8G \
    --direct=1 --runtime=60 --group_reporting \
    --directory=/mnt/test

# Mixed 70% read / 30% write (OLTP)
fio --name=mixed-rw --ioengine=io_uring --rw=randrw --rwmixread=70 \
    --bs=4k --iodepth=32 --numjobs=4 --size=4G \
    --direct=1 --runtime=60 --group_reporting \
    --directory=/mnt/test

Sample output (random read):

rand-read: (groupid=0, jobs=4): err=0: pid=5678
  read: IOPS=185.2k, BW=723MiB/s (758MB/s)
    slat (nsec): min=1200 max=85000 avg=2850
    clat (usec): min=45 max=12500 avg=1350
    lat (usec): min=48 max=12520 avg=1353
    clat percentiles (usec):
      1.00th=[ 120] 5.00th=[ 245] 10.00th=[ 400]
      50.00th=[1150] 90.00th=[2350] 95.00th=[2900]
      99.00th=[4500] 99.50th=[5800] 99.99th=[10800]
    bw (KiB/s): min=680000 max=760000 avg=740800
    iops      : min=170000 max=190000 avg=185200

Important fields:

IOPS – 185 k random reads, indicating a high‑end NVMe.

slat – Submission latency (µs).

clat – Completion latency (the metric that matters most).

lat – Total latency = slat + clat.

clat percentiles – P99 and P99.9 reveal long‑tail latency hidden by the average.

io_uring vs libaio : on kernel 6.x, io_uring reduces system‑call overhead and can be 10‑30 % faster in high‑IOPS workloads. Modern databases (PostgreSQL 16+, RocksDB) already support it.

Filesystem Selection

The choice of filesystem can affect performance by 2‑3×.

ext4 – Very stable, excellent small‑file performance, default on Ubuntu.

xfs – Excellent concurrent write handling (delayed allocation), default on RHEL, recommended for database servers.

btrfs – Native snapshots, transparent compression (zstd/lzo), useful for log storage or containers.

Mount‑option tuning (example for ext4 and xfs):

# ext4 high‑performance mount
mount -o noatime,nodiratime,barrier=0,data=writeback /dev/sda1 /data
# xfs high‑performance mount
mount -o noatime,logbufs=8,logbsize=256k /dev/sda1 /data
# Persist in /etc/fstab
/dev/nvme0n1p1  /data  xfs  defaults,noatime,logbufs=8,logbsize=256k  0 2

I/O Performance Tuning

Readahead

Prefetch size influences sequential workloads.

# Show current readahead (units of 512‑byte sectors, default 256 = 128 KB)
blockdev --getra /dev/sda
# Increase for sequential workloads (Kafka, HDFS)
sudo blockdev --setra 2048 /dev/sda   # 1 MB
# Decrease for pure random workloads (OLTP)
sudo blockdev --setra 64 /dev/sda     # 32 KB
# Persist via udev rule
cat > /etc/udev/rules.d/61-readahead.rules <<'EOF'
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{bdi/read_ahead_kb}="1024"
EOF

Dirty‑Page Parameters

Control how much dirty data can accumulate before being flushed.

# Current values
sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
# Database‑focused tuning (low latency)
sysctl -w vm.dirty_ratio=5
sysctl -w vm.dirty_background_ratio=2
sysctl -w vm.dirty_expire_centisecs=1000   # 10 s
sysctl -w vm.dirty_writeback_centisecs=100   # 1 s
# Log/streaming workload (high throughput)
sysctl -w vm.dirty_ratio=40
sysctl -w vm.dirty_background_ratio=20
sysctl -w vm.dirty_expire_centisecs=6000
sysctl -w vm.dirty_writeback_centisecs=500
# Persist
cat > /etc/sysctl.d/60-io-tuning.conf <<'EOF'
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100
EOF
sysctl --system

Queue Depth and Merge Settings

# Current queue depth (default 256)
cat /sys/block/sda/queue/nr_requests
# Increase for high‑concurrency workloads
echo 1024 > /sys/block/sda/queue/nr_requests
# Merge policy (0 = allow, 2 = disable)
cat /sys/block/sda/queue/nomerges
# For pure random I/O, disable merging
echo 2 > /sys/block/sda/queue/nomerges

cgroup v2 I/O Limiting (multi‑tenant environments)

# Find device major:minor (e.g., 8:0 for /dev/sda)
ls -l /dev/sda
# Create cgroup and enable I/O controller
mkdir -p /sys/fs/cgroup/backup-jobs
echo "+io" > /sys/fs/cgroup/backup-jobs/cgroup.subtree_control
# Limit read/write bandwidth to 50 MiB/s and IOPS to 1000
cat > /sys/fs/cgroup/backup-jobs/io.max <<'EOF'
8:0 rbps=52428800 wbps=52428800 riops=1000 wiops=1000
EOF
# Add the backup process to the cgroup
echo $BACKUP_PID > /sys/fs/cgroup/backup-jobs/cgroup.procs
# Optional latency target (5 ms)
echo "8:0 target=5000" > /sys/fs/cgroup/backup-jobs/io.latency

Common I/O Problem Cases

Case 1 – Night‑time Backup Causes MySQL Latency Spike

Symptom : Every night 02:00‑02:30 MySQL slow‑query count jumps 10×, await rises from 0.5 ms to 15 ms.

Investigation :

# 1. Verify I/O load
iostat -xmt 1
# 2. Identify heavy readers
sudo iotop -oP
# Output shows mysqld (≈98 MB/s) and rsync (≈22 MB/s) reading the data directory.
# 3. Confirm rsync is a scheduled job
ps aux | grep rsync

Resolution (choose one):

Lower backup priority:

ionice -c 3 nice -n 19 rsync -avz /var/lib/mysql/ backup-server:/backup/mysql/

cgroup bandwidth limit: set io.max to 50 MiB/s for the backup cgroup.

Use rsync --bwlimit=50000 (≈50 MiB/s).

Case 2 – ext4 Journal Write Amplification Slows Elasticsearch

Symptom : Write throughput drops from 200 MB/s to 40 MB/s, w/s is high but wareq‑sz stays at 4 KB.

Investigation :

# Observe write pattern
iostat -xmt -d sda 1
# Trace writes
sudo blktrace -d /dev/sda -w 5 -o journal-trace
blkparse -i journal-trace | grep "W" | awk '{print $NF}' | sort | uniq -c | sort -rn | head
# Shows many writes from [jbd2/sda1-8] (ext4 journal thread)
# Check journal mode
tune2fs -l /dev/sda1 | grep Journal
# Output: default mount option is journal_data (writes data twice).

Resolution :

Switch to ordered mode (only metadata journaled): sudo mount -o remount,data=ordered /data or add data=ordered to /etc/fstab.

For Elasticsearch, which handles its own consistency, writeback can be used if the system has a UPS/BBU.

Case 3 – NVMe %util Shows 100 % but Device Is Not Saturated

Symptom : Monitoring alerts fire on %util=100 for an NVMe, yet await stays at 0.08 ms.

Investigation :

# Detailed iostat
iostat -xmt -d nvme0n1 1
# Example: r/s=45 k, r_await=0.08 ms, %util=100%.
# %util on multi‑queue devices is calculated as (request time × request count) / interval, so many tiny concurrent requests can push it to 100 % even when each request finishes quickly.

Conclusion : For NVMe devices, base health checks on await and actual IOPS versus the device’s rated capabilities, not on %util. Adjust monitoring rules accordingly.

Real‑Time I/O Latency with bpftrace

biolatency – Per‑disk latency histogram

# Simple per‑disk latency histogram, 1‑second intervals
sudo biolatency-bpfcc -D 1

Typical output shows most I/O completing in 8‑31 µs with a few outliers in the ms range, highlighting long‑tail latency.

Custom per‑process latency script

# Save as io-latency-by-process.bt and run
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; @comm[args->dev, args->sector] = comm; }
tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ {
    $lat = (nsecs - @start[args->dev, args->sector]) / 1000; // µs
    @latency[@comm[args->dev, args->sector]] = hist($lat);
    delete(@start[args->dev, args->sector]);
    delete(@comm[args->dev, args->sector]);
}
END { clear(@start); clear(@comm); }'

The output groups latency histograms by process name (e.g., mysqld, rsync).

biosnoop – Per‑request trace

# Show each I/O request with process name, latency, sector, size
sudo biosnoop-bpfcc -d nvme0n1
# Filter for latency > 1 ms
sudo biosnoop-bpfcc -d nvme0n1 -Q | awk '$NF > 1.0'

ext4slower / xfsslower – Filesystem‑level slow operations

# Show operations taking >1 ms on ext4
sudo ext4slower-bpfcc 1
# Sample output:
# TIME   COMM   PID  T BYTES OFF_KB LAT(ms) FILENAME
# 15:30:01 mysqld 12847 R 16384 1024 2.35 ibdata1
# 15:30:01 mysqld 12847 S 0 0 5.80 ib_logfile0

These tools directly associate latency with the affected file, making it easy to pinpoint problematic journal or fsync operations.

Step‑by‑Step Troubleshooting Process

System stall / business timeout
  |
  v
top → %wa (iowait)
  Low → not I/O, check CPU/memory/network
  High → iostat -xmt 1
    await normal → possible application block, not disk
    await high →
      Device type?
        NVMe → look at await + IOPS, ignore %util
        HDD  → consider %util + await + aqu‑sz together
      iotop -oP → identify heavy‑I/O processes
      Examine I/O pattern (wareq‑sz / rareq‑sz)
      Deep dive:
        blktrace + btt → stage latency breakdown
        bpftrace → latency distribution, long tail
        fio → benchmark baseline
      Tune / resolve:
        Scheduler, readahead, dirty‑ratio, cgroup limits, hardware upgrade

Tool Quick‑Reference

top

– Check iowait (negligible overhead). iostat – Overall IOPS, latency, throughput (negligible overhead). sar – Historical I/O trends (reads log files). iotop – Per‑process I/O (low overhead). pidstat – Scriptable per‑process I/O. ionice – Adjust process I/O priority. blktrace – Block‑layer request lifecycle (medium overhead, large logs). btt – Offline analysis of blktrace data. fio – Disk benchmark (high overhead, stress test). biolatency – I/O latency histogram (low overhead). biosnoop – Per‑request trace (medium overhead). ext4slower / xfsslower – Filesystem‑level slow ops (low overhead).

Key Technical Insights

Layered diagnosis is essential – locate the bottleneck at the correct stack layer before changing kernel parameters. %util is meaningful for HDDs but not for SSD/NVMe; rely on await and IOPS for solid‑state devices. await is the primary latency indicator: HDD normal 5‑15 ms, SATA SSD 0.05‑0.5 ms, NVMe 0.01‑0.03 ms.

The svctm field is deprecated and should be ignored.

Use the tool chain in order: iostatiotopblktracebpftracefio for baseline.

Scheduler choice matters: none for NVMe, mq-deadline for HDD, bfq for shared HDD workloads.

Database servers benefit from low dirty_ratio (5) and low dirty_background_ratio (2) to avoid sudden flush‑induced latency spikes.

Match readahead to workload: large for sequential (Kafka, HDFS), small for random (OLTP).

cgroup v2 io controller provides bandwidth, IOPS and latency guarantees for multi‑tenant environments.

Further Reading

Linux Block I/O Layer – kernel documentation.

iostat(1) man page – sysstat.

BPF Performance Tools (Brendan Gregg) – Chapter 9.

fio Documentation – official site.

Linux Storage Stack Diagram – community visual guide.

io_uring kernel documentation.

liburing GitHub repository.

bcc/libbpf‑tools source.

NVMe CLI – management tool.

Systems Performance, 2nd Edition – Brendan Gregg.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance tuningcgroupLinux I/Ofioiotopblktraceiostatbpftrace
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.