Why Linux Servers Freeze: Deep Dive into iostat, iotop, blktrace, fio & bpftrace for Disk IO Troubleshooting
This comprehensive guide walks you through the Linux IO stack, explains key metrics from iostat and iotop, demonstrates advanced tracing with blktrace and bpftrace, shows how to benchmark with fio, and provides practical tuning steps to resolve high‑IO latency and system hangs.
1. Overview
When iowait spikes to 80% and simple commands like ls take seconds, the system appears frozen. This is common on machines running heavy‑IO workloads such as MySQL, Elasticsearch, or Kafka. CPU and memory may look fine, but the disk subsystem is the bottleneck.
2. Linux IO Stack Full View
Understanding where an IO request travels helps avoid blind guesses. The stack consists of:
Application (read/write/io_uring)
|
v
VFS (Virtual File System) – Page Cache
|
v
Filesystem (ext4 / xfs / btrfs)
|
v
Block Layer (merge, scheduler, multi‑queue blk‑mq)
|
v
Device driver (NVMe / SCSI / virtio‑blk)
|
v
Physical device (NVMe SSD / SATA SSD / HDD / cloud disk)Each layer can become a bottleneck, so troubleshooting proceeds layer by layer.
2.1 VFS + Page Cache
If free -h shows a tiny buff/cache or sar -B reports high pgpgin, many reads bypass the cache and hit the disk. Writes first go to dirty pages and are flushed asynchronously by pdflush/flush threads.
2.2 Filesystem Layer
Journaling (ext4) or logging (xfs) adds extra IO. Use filefrag to inspect fragmentation, which can turn sequential reads into random reads.
2.3 Block Layer
This is the layer visible to iostat. Requests are merged, sorted, and scheduled. Modern kernels use the multi‑queue blk‑mq architecture, giving each CPU its own software queue.
2.4 Device Driver & Physical Device
NVMe devices expose many hardware queues (often 64 queues × 64K depth), while SATA SSDs have a single queue. This directly impacts concurrent IO performance.
3. IO Scheduler Details
Linux 6.x provides four schedulers. Choose based on device type:
none : No scheduling, direct to hardware – best for NVMe.
mq‑deadline : General purpose, deadline‑based, read‑preferring – good for SATA SSDs and HDDs.
bfq : Fair bandwidth allocation per process – ideal for multi‑tenant HDD workloads.
kyber : Light‑weight, latency‑targeted – suited for low‑latency SSDs.
Persist the choice with a udev rule, e.g.:
cat > /etc/udev/rules.d/60-io-scheduler.rules <<'EOF'
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"
EOF4. iostat Deep Dive
iostatis the first stop, but it reports many fields. The most important ones are: r/s / w/s: IO operations per second (IOPS). HDD >150 IOPS is suspicious; NVMe can handle >100 k IOPS. rkB/s / wkB/s: Throughput in KB/s. rrqm/s / wrqm/s: Number of merged requests – high values mean the workload is sequential or nicely aligned.
r_await / w_await : Average latency (ms) including queue wait and device service. HDD normal 5‑15 ms, SATA SSD 0.05‑0.5 ms, NVMe 0.01‑0.03 ms. aqu-sz: Average queue length – large values indicate the device cannot keep up. %util: Device busy time. Useful for HDD, but misleading for SSD/NVMe (high %util can coexist with low latency).
The legacy svctm field is deprecated and should be ignored.
4.1 iostat Practical Checklist
# Real‑time view, 1‑second interval, MB units
iostat -xmt 1
# Focus on a specific disk
iostat -xmt -d nvme0n1 14.2 Quick Diagnosis Template
# 1. await high + aqu‑sz high + %util high → disk truly saturated (common on HDD)
# 2. await high + aqu‑sz low + %util low → single slow IO, possible hardware fault or RAID downgrade
# 3. await normal + w/s huge + wkB/s low → many tiny writes – consider merging or raise dirty_ratio
# 4. rrqm/s ≈ 0 + rareq‑sz very small → pure random reads, page cache ineffective
# 5. w_await >> r_await → write bottleneck – check fsync frequency or journal mode5. iotop – Pinpoint IO‑Heavy Processes
iostattells you the disk is busy; iotop tells you which process is responsible.
5.1 Basic Usage
# Requires root
sudo iotop -oP
# Faster C implementation (optional)
sudo apt install iotop‑c && sudo iotop‑c -oP5.2 Interpreting Output
Key columns:
Total DISK READ/WRITE : All processes’ requested IO before page cache.
Actual DISK READ/WRITE : IO that really hit the device (writes are often lower because of caching).
IO> : Percentage of time the process waited for IO.
PRIO : IO priority, e.g. be/4 (best‑effort class 4).
Example shows MySQL reading 98 MB/s and a backup tar reading 22 MB/s, together saturating the disk.
5.3 Adjusting Process Priority
# Set class 3 (idle) for a backup process
sudo ionice -c 3 -p 3021
# Run a command directly with low priority
sudo ionice -c 3 nice -n 19 tar czf /backup/db.tar /var/lib/mysqlNote: ionice only works with bfq or mq‑deadline schedulers; none (NVMe) ignores it.
5.4 pidstat – Scriptable Per‑Process IO
# Show IO every second for 10 samples
pidstat -d 1 10
# Focus on a single PID
pidstat -d -p 12847 1Fields include kB_rd/s, kB_wr/s, kB_ccwr/s (cancelled writes) and iodelay (ticks spent waiting).
6. blktrace + btt – Block‑Layer Deep Analysis
6.1 How blktrace Works
blktrace inserts tracepoints in the block layer and records the full lifecycle of each request:
Q (Queued) → request enters block queue
G (Get) → request structure allocated
M (Merged) → merged with existing request
I (Inserted)→ inserted into scheduler queue
D (Dispatched)→ sent to device driver
C (Completed)→ device finishes IOTime differences give per‑stage latency (e.g. Q→D = software queue time, D→C = hardware service time).
6.2 Capturing Traces
# Record 10 seconds on /dev/sda
sudo blktrace -d /dev/sda -w 10 -o trace
# Convert to readable format
blkparse -i trace -o trace.txt6.3 Summarizing with btt
# Generate latency distribution
blkparse -i trace -d trace.bin
btt -i trace.binTypical btt output shows Q2C (total latency), Q2D (software latency) and D2C (hardware latency). If Q2D dominates, the bottleneck is in the scheduler or queue; if D2C dominates, the device itself is the limit.
6.4 Visualisation with iowatcher
# Install and generate SVG chart
sudo apt install iowatcher
iowatcher -t trace -o io-pattern.svgThe chart displays IOPS, throughput, latency distribution, and offset distribution (sequential vs random).
7. fio – Disk Benchmarking
Before tuning, know the raw capability of the disk.
7.1 Core Parameters
--rw: IO pattern (read, write, randread, randwrite, randrw). --bs: Block size (4 k for random DB workloads, 1 M for sequential). --iodepth: Queue depth (HDD 1‑4, SATA SSD 32, NVMe 64‑128). --ioengine: libaio or io_uring (recommended on kernel 6.x). --numjobs: Parallel jobs. --size: Test file size (≥ 2× RAM to avoid cache). --direct=1: Bypass page cache. --runtime: 60‑120 s for stable results.
7.2 Example Scenarios
# Random read 4 k on NVMe (DB‑like)
fio --name=rand-read --ioengine=io_uring --rw=randread \
--bs=4k --iodepth=64 --numjobs=4 --size=4G \
--direct=1 --runtime=60 --group_reporting \
--filename=/dev/nvme0n1
# Sequential read 1 M (large file scan)
fio --name=seq-read --ioengine=io_uring --rw=read \
--bs=1m --iodepth=16 --size=8G \
--direct=1 --runtime=60 --group_reporting \
--directory=/mnt/test7.3 Interpreting fio Output
Key fields:
IOPS : Operations per second.
BW : Bandwidth (MiB/s).
slat : Submission latency (user‑to‑kernel).
clat : Completion latency (kernel‑to‑device).
lat : Total latency = slat + clat.
clat percentiles : Tail latency (P99, P99.9).
7.4 io_uring vs libaio
# libaio baseline
fio --name=aio-test --ioengine=libaio --rw=randread \
--bs=4k --iodepth=128 --size=4G --direct=1 --runtime=30
# io_uring version
fio --name=uring-test --ioengine=io_uring --rw=randread \
--bs=4k --iodepth=128 --size=4G --direct=1 --runtime=30io_uring typically yields 10‑30 % higher IOPS because it reduces system calls.
8. Filesystem Choice
The filesystem directly impacts IO performance.
8.1 ext4 / xfs / btrfs Comparison
ext4 : Very stable, good for general Linux servers, excellent small‑file performance.
xfs : Excellent large sequential writes, good for databases and Kubernetes nodes.
btrfs : Native snapshots, transparent compression (zstd/lzo), useful for backup or container storage.
Recommendation:
Database servers (MySQL/PostgreSQL): xfs – low‑latency journaling.
General servers: ext4 – mature and well‑documented.
Snapshot/compression needs: btrfs .
Kubernetes nodes: xfs – works best with overlayfs.
8.2 Mount Options for Performance
# ext4 high‑performance mount
mount -o noatime,nodiratime,barrier=0,data=writeback /dev/sda1 /data
# xfs high‑performance mount
mount -o noatime,logbufs=8,logbsize=256k /dev/sda1 /data
# Persist in /etc/fstab
/dev/nvme0n1p1 /data xfs defaults,noatime,logbufs=8,logbsize=256k 0 29. IO Tuning Parameters
9.1 Readahead
# Show current value (default 256 = 128 KB)
blockdev --getra /dev/sda
# Increase for sequential workloads (e.g. Kafka)
sudo blockdev --setra 2048 /dev/sda # 1 MB
# Decrease for random DB workloads
sudo blockdev --setra 64 /dev/sda # 32 KB9.2 Dirty Page Ratios
# Current settings
sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
# Database‑focused (low latency)
sysctl -w vm.dirty_ratio=5
sysctl -w vm.dirty_background_ratio=2
sysctl -w vm.dirty_expire_centisecs=1000
sysctl -w vm.dirty_writeback_centisecs=100
# Log‑heavy (throughput)
sysctl -w vm.dirty_ratio=40
sysctl -w vm.dirty_background_ratio=20
sysctl -w vm.dirty_expire_centisecs=6000
sysctl -w vm.dirty_writeback_centisecs=500
# Persist in /etc/sysctl.d/60-io-tuning.conf
cat > /etc/sysctl.d/60-io-tuning.conf <<'EOF'
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100
EOF
sysctl --system9.3 Queue Depth & nr_requests
# View current depth
cat /sys/block/sda/queue/nr_requests
# Increase for high‑concurrency workloads
echo 1024 > /sys/block/sda/queue/nr_requests
# Control request merging (0 = allow, 2 = disable)
cat /sys/block/sda/queue/nomerges
# For pure random IO, set to 2 to skip merge checks
echo 2 > /sys/block/sda/queue/nomerges9.4 cgroup v2 IO Limiting
# Identify device major:minor (e.g. 8:0 for /dev/sda)
ls -l /dev/sda
# Create a cgroup for backup jobs
mkdir -p /sys/fs/cgroup/backup-jobs
echo "+io" > /sys/fs/cgroup/backup-jobs/cgroup.subtree_control
# Limit to 50 MB/s and 1000 IOPS
echo "8:0 rbps=52428800 wbps=52428800 riops=1000 wiops=1000" > /sys/fs/cgroup/backup-jobs/io.max
# Add the backup process to the group
echo $BACKUP_PID > /sys/fs/cgroup/backup-jobs/cgroup.procs
# Optional latency target (5 ms)
echo "8:0 target=5000" > /sys/fs/cgroup/backup-jobs/io.latency10. Case Studies
10.1 Nightly Backup Slowing MySQL
Symptoms: MySQL slow‑query count spikes at 02:00, await rises from 0.5 ms to 15 ms.
Investigation:
iostat shows w_await jump.
iotop reveals MySQL and a tar backup both reading heavily.
Conclusion: Backup consumes most bandwidth, starving MySQL.
Solutions:
# Lower backup priority
ionice -c 3 nice -n 19 rsync -avz /var/lib/mysql/ backup:/backup/mysql
# Or cgroup bandwidth limit
echo "8:0 rbps=52428800 wbps=52428800" > /sys/fs/cgroup/backup/io.max
# Or rsync bandwidth limit
rsync -avz --bwlimit=50000 /var/lib/mysql/ backup:/backup/mysql10.2 ext4 journal causing Write Amplification
Symptoms: Elasticsearch write throughput drops from 200 MB/s to 40 MB/s; w/s high but wareq‑sz only 4 KB.
Investigation:
iostat shows high write IOPS with tiny request size.
blktrace identifies most writes coming from jbd2 (ext4 journal).
tune2fs reveals mount option journal_data, which journals both data and metadata, doubling write volume.
Fix:
# Remount with ordered mode (only metadata journaled)
sudo mount -o remount,data=ordered /data
# Or set in /etc/fstab
/dev/sda1 /data ext4 defaults,noatime,data=ordered 0 210.3 NVMe %util 100 % but No Latency Impact
Alert: %util constantly 100 % on an NVMe, yet r_await stays at 0.08 ms.
Explanation:
NVMe can handle >500 k IOPS; observed 45 k IOPS is far below capacity. %util is calculated as (IO count × average service time) / interval. With many concurrent queues the metric reaches 100 % even when each request is fast.
Action: Stop using %util for NVMe alerts; monitor await and IOPS instead. Example Prometheus rule provided in the source.
11. Troubleshooting Workflow Summary
System slowdown → top %wa high → iostat await high →
• NVMe: check await & IOPS (ignore %util)
• HDD: check await + %util + aqu‑sz
→ iotop to find offending process →
Analyze IO pattern (rareq‑sz, rareq‑sz) →
If needed, blktrace + btt for stage latency →
bpftrace for latency distribution →
fio for baseline performance →
Apply tuning (scheduler, readahead, dirty ratios, cgroup limits) →
Verify with iostat / iotop again.12. Tool Cheat Sheet
top : Quick view of %wa (negligible overhead).
iostat : Overall IOPS, latency, throughput (very low overhead).
sar : Historical trends from /var/log/sysstat/ (no runtime cost).
iotop : Real‑time per‑process IO (low overhead, root required).
pidstat : Scriptable per‑process IO stats.
ionice : Adjust process IO priority.
blktrace : Block‑layer request tracing (moderate data volume).
btt : Offline analysis of blktrace data.
fio : Disk benchmark (high load, use carefully).
biolatency : IO latency histogram (low overhead).
biosnoop : Per‑request trace with process name.
ext4slower / xfsslower : Filesystem‑level slow‑IO tracing.
13. Tuning Reference Table
IO Scheduler : /sys/block/*/queue/scheduler – NVMe none, HDD mq-deadline, multi‑tenant HDD bfq.
Readahead : blockdev --setra – 1 MB for sequential, 32 KB for random.
Dirty Ratios : vm.dirty_ratio (hard) and vm.dirty_background_ratio (soft) – DB: 5 % / 2 %; log‑heavy: 40 % / 20 %.
Dirty Expire / Writeback : vm.dirty_expire_centisecs and vm.dirty_writeback_centisecs – DB: 1000 / 100, logs: 6000 / 500.
Queue Depth : /sys/block/*/queue/nr_requests – increase to 1024 for high concurrency.
IO Merge : /sys/block/*/queue/nomerges – 2 for pure random, 0 for sequential.
Mount Options : noatime always; data=ordered for databases, data=writeback only with power‑loss protection.
14. Conclusions
Key take‑aways:
Master the Linux IO stack and use the layered approach: iostat → iotop → blktrace → bpftrace → fio.
For HDDs, %util is a reliable saturation indicator; for SSD/NVMe focus on await and IOPS.
Choose the right scheduler: none for NVMe, mq-deadline for SATA/HDD, bfq for shared HDD.
Adjust dirty_ratio and dirty_background_ratio to eliminate write‑latency spikes in databases.
Tailor readahead to workload (large for sequential, small for random).
Use cgroup v2 io.max / io.latency / io.weight to enforce fairness in multi‑tenant or container environments.
15. Further Learning
Linux Block IO Layer documentation – deep dive into blk‑mq architecture. iostat(1) man page – precise field definitions.
Brendan Gregg’s “BPF Performance Tools” – eBPF tracing for storage.
fio official documentation – exhaustive parameter guide.
Linux Storage Stack Diagram – visual overview of each layer.
io_uring kernel docs and liburing GitHub – modern asynchronous IO.
bcc/libbpf‑tools – source of biolatency, biosnoop, etc.
NVMe‑CLI – SMART data, firmware updates, namespace management.
Systems Performance (2nd ed.) by Brendan Gregg – classic performance analysis.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
