Operations 48 min read

Why Linux Servers Freeze: Deep Dive into iostat, iotop, blktrace, fio & bpftrace for Disk IO Troubleshooting

This comprehensive guide walks you through the Linux IO stack, explains key metrics from iostat and iotop, demonstrates advanced tracing with blktrace and bpftrace, shows how to benchmark with fio, and provides practical tuning steps to resolve high‑IO latency and system hangs.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Why Linux Servers Freeze: Deep Dive into iostat, iotop, blktrace, fio & bpftrace for Disk IO Troubleshooting

1. Overview

When iowait spikes to 80% and simple commands like ls take seconds, the system appears frozen. This is common on machines running heavy‑IO workloads such as MySQL, Elasticsearch, or Kafka. CPU and memory may look fine, but the disk subsystem is the bottleneck.

2. Linux IO Stack Full View

Understanding where an IO request travels helps avoid blind guesses. The stack consists of:

Application (read/write/io_uring)
    |
    v
VFS (Virtual File System) – Page Cache
    |
    v
Filesystem (ext4 / xfs / btrfs)
    |
    v
Block Layer (merge, scheduler, multi‑queue blk‑mq)
    |
    v
Device driver (NVMe / SCSI / virtio‑blk)
    |
    v
Physical device (NVMe SSD / SATA SSD / HDD / cloud disk)

Each layer can become a bottleneck, so troubleshooting proceeds layer by layer.

2.1 VFS + Page Cache

If free -h shows a tiny buff/cache or sar -B reports high pgpgin, many reads bypass the cache and hit the disk. Writes first go to dirty pages and are flushed asynchronously by pdflush/flush threads.

2.2 Filesystem Layer

Journaling (ext4) or logging (xfs) adds extra IO. Use filefrag to inspect fragmentation, which can turn sequential reads into random reads.

2.3 Block Layer

This is the layer visible to iostat. Requests are merged, sorted, and scheduled. Modern kernels use the multi‑queue blk‑mq architecture, giving each CPU its own software queue.

2.4 Device Driver & Physical Device

NVMe devices expose many hardware queues (often 64 queues × 64K depth), while SATA SSDs have a single queue. This directly impacts concurrent IO performance.

3. IO Scheduler Details

Linux 6.x provides four schedulers. Choose based on device type:

none : No scheduling, direct to hardware – best for NVMe.

mq‑deadline : General purpose, deadline‑based, read‑preferring – good for SATA SSDs and HDDs.

bfq : Fair bandwidth allocation per process – ideal for multi‑tenant HDD workloads.

kyber : Light‑weight, latency‑targeted – suited for low‑latency SSDs.

Persist the choice with a udev rule, e.g.:

cat > /etc/udev/rules.d/60-io-scheduler.rules <<'EOF'
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"
EOF

4. iostat Deep Dive

iostat

is the first stop, but it reports many fields. The most important ones are: r/s / w/s: IO operations per second (IOPS). HDD >150 IOPS is suspicious; NVMe can handle >100 k IOPS. rkB/s / wkB/s: Throughput in KB/s. rrqm/s / wrqm/s: Number of merged requests – high values mean the workload is sequential or nicely aligned.

r_await / w_await : Average latency (ms) including queue wait and device service. HDD normal 5‑15 ms, SATA SSD 0.05‑0.5 ms, NVMe 0.01‑0.03 ms. aqu-sz: Average queue length – large values indicate the device cannot keep up. %util: Device busy time. Useful for HDD, but misleading for SSD/NVMe (high %util can coexist with low latency).

The legacy svctm field is deprecated and should be ignored.

4.1 iostat Practical Checklist

# Real‑time view, 1‑second interval, MB units
iostat -xmt 1
# Focus on a specific disk
iostat -xmt -d nvme0n1 1

4.2 Quick Diagnosis Template

# 1. await high + aqu‑sz high + %util high → disk truly saturated (common on HDD)
# 2. await high + aqu‑sz low + %util low → single slow IO, possible hardware fault or RAID downgrade
# 3. await normal + w/s huge + wkB/s low → many tiny writes – consider merging or raise dirty_ratio
# 4. rrqm/s ≈ 0 + rareq‑sz very small → pure random reads, page cache ineffective
# 5. w_await >> r_await → write bottleneck – check fsync frequency or journal mode

5. iotop – Pinpoint IO‑Heavy Processes

iostat

tells you the disk is busy; iotop tells you which process is responsible.

5.1 Basic Usage

# Requires root
sudo iotop -oP
# Faster C implementation (optional)
sudo apt install iotop‑c && sudo iotop‑c -oP

5.2 Interpreting Output

Key columns:

Total DISK READ/WRITE : All processes’ requested IO before page cache.

Actual DISK READ/WRITE : IO that really hit the device (writes are often lower because of caching).

IO> : Percentage of time the process waited for IO.

PRIO : IO priority, e.g. be/4 (best‑effort class 4).

Example shows MySQL reading 98 MB/s and a backup tar reading 22 MB/s, together saturating the disk.

5.3 Adjusting Process Priority

# Set class 3 (idle) for a backup process
sudo ionice -c 3 -p 3021
# Run a command directly with low priority
sudo ionice -c 3 nice -n 19 tar czf /backup/db.tar /var/lib/mysql

Note: ionice only works with bfq or mq‑deadline schedulers; none (NVMe) ignores it.

5.4 pidstat – Scriptable Per‑Process IO

# Show IO every second for 10 samples
pidstat -d 1 10
# Focus on a single PID
pidstat -d -p 12847 1

Fields include kB_rd/s, kB_wr/s, kB_ccwr/s (cancelled writes) and iodelay (ticks spent waiting).

6. blktrace + btt – Block‑Layer Deep Analysis

6.1 How blktrace Works

blktrace inserts tracepoints in the block layer and records the full lifecycle of each request:

Q (Queued)   → request enters block queue
G (Get)      → request structure allocated
M (Merged)   → merged with existing request
I (Inserted)→ inserted into scheduler queue
D (Dispatched)→ sent to device driver
C (Completed)→ device finishes IO

Time differences give per‑stage latency (e.g. Q→D = software queue time, D→C = hardware service time).

6.2 Capturing Traces

# Record 10 seconds on /dev/sda
sudo blktrace -d /dev/sda -w 10 -o trace
# Convert to readable format
blkparse -i trace -o trace.txt

6.3 Summarizing with btt

# Generate latency distribution
blkparse -i trace -d trace.bin
btt -i trace.bin

Typical btt output shows Q2C (total latency), Q2D (software latency) and D2C (hardware latency). If Q2D dominates, the bottleneck is in the scheduler or queue; if D2C dominates, the device itself is the limit.

6.4 Visualisation with iowatcher

# Install and generate SVG chart
sudo apt install iowatcher
iowatcher -t trace -o io-pattern.svg

The chart displays IOPS, throughput, latency distribution, and offset distribution (sequential vs random).

7. fio – Disk Benchmarking

Before tuning, know the raw capability of the disk.

7.1 Core Parameters

--rw

: IO pattern (read, write, randread, randwrite, randrw). --bs: Block size (4 k for random DB workloads, 1 M for sequential). --iodepth: Queue depth (HDD 1‑4, SATA SSD 32, NVMe 64‑128). --ioengine: libaio or io_uring (recommended on kernel 6.x). --numjobs: Parallel jobs. --size: Test file size (≥ 2× RAM to avoid cache). --direct=1: Bypass page cache. --runtime: 60‑120 s for stable results.

7.2 Example Scenarios

# Random read 4 k on NVMe (DB‑like)
fio --name=rand-read --ioengine=io_uring --rw=randread \
    --bs=4k --iodepth=64 --numjobs=4 --size=4G \
    --direct=1 --runtime=60 --group_reporting \
    --filename=/dev/nvme0n1

# Sequential read 1 M (large file scan)
fio --name=seq-read --ioengine=io_uring --rw=read \
    --bs=1m --iodepth=16 --size=8G \
    --direct=1 --runtime=60 --group_reporting \
    --directory=/mnt/test

7.3 Interpreting fio Output

Key fields:

IOPS : Operations per second.

BW : Bandwidth (MiB/s).

slat : Submission latency (user‑to‑kernel).

clat : Completion latency (kernel‑to‑device).

lat : Total latency = slat + clat.

clat percentiles : Tail latency (P99, P99.9).

7.4 io_uring vs libaio

# libaio baseline
fio --name=aio-test --ioengine=libaio --rw=randread \
    --bs=4k --iodepth=128 --size=4G --direct=1 --runtime=30
# io_uring version
fio --name=uring-test --ioengine=io_uring --rw=randread \
    --bs=4k --iodepth=128 --size=4G --direct=1 --runtime=30

io_uring typically yields 10‑30 % higher IOPS because it reduces system calls.

8. Filesystem Choice

The filesystem directly impacts IO performance.

8.1 ext4 / xfs / btrfs Comparison

ext4 : Very stable, good for general Linux servers, excellent small‑file performance.

xfs : Excellent large sequential writes, good for databases and Kubernetes nodes.

btrfs : Native snapshots, transparent compression (zstd/lzo), useful for backup or container storage.

Recommendation:

Database servers (MySQL/PostgreSQL): xfs – low‑latency journaling.

General servers: ext4 – mature and well‑documented.

Snapshot/compression needs: btrfs .

Kubernetes nodes: xfs – works best with overlayfs.

8.2 Mount Options for Performance

# ext4 high‑performance mount
mount -o noatime,nodiratime,barrier=0,data=writeback /dev/sda1 /data
# xfs high‑performance mount
mount -o noatime,logbufs=8,logbsize=256k /dev/sda1 /data
# Persist in /etc/fstab
/dev/nvme0n1p1 /data xfs defaults,noatime,logbufs=8,logbsize=256k 0 2

9. IO Tuning Parameters

9.1 Readahead

# Show current value (default 256 = 128 KB)
blockdev --getra /dev/sda
# Increase for sequential workloads (e.g. Kafka)
sudo blockdev --setra 2048 /dev/sda   # 1 MB
# Decrease for random DB workloads
sudo blockdev --setra 64 /dev/sda    # 32 KB

9.2 Dirty Page Ratios

# Current settings
sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
# Database‑focused (low latency)
sysctl -w vm.dirty_ratio=5
sysctl -w vm.dirty_background_ratio=2
sysctl -w vm.dirty_expire_centisecs=1000
sysctl -w vm.dirty_writeback_centisecs=100
# Log‑heavy (throughput)
sysctl -w vm.dirty_ratio=40
sysctl -w vm.dirty_background_ratio=20
sysctl -w vm.dirty_expire_centisecs=6000
sysctl -w vm.dirty_writeback_centisecs=500
# Persist in /etc/sysctl.d/60-io-tuning.conf
cat > /etc/sysctl.d/60-io-tuning.conf <<'EOF'
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100
EOF
sysctl --system

9.3 Queue Depth & nr_requests

# View current depth
cat /sys/block/sda/queue/nr_requests
# Increase for high‑concurrency workloads
echo 1024 > /sys/block/sda/queue/nr_requests
# Control request merging (0 = allow, 2 = disable)
cat /sys/block/sda/queue/nomerges
# For pure random IO, set to 2 to skip merge checks
echo 2 > /sys/block/sda/queue/nomerges

9.4 cgroup v2 IO Limiting

# Identify device major:minor (e.g. 8:0 for /dev/sda)
ls -l /dev/sda
# Create a cgroup for backup jobs
mkdir -p /sys/fs/cgroup/backup-jobs
echo "+io" > /sys/fs/cgroup/backup-jobs/cgroup.subtree_control
# Limit to 50 MB/s and 1000 IOPS
echo "8:0 rbps=52428800 wbps=52428800 riops=1000 wiops=1000" > /sys/fs/cgroup/backup-jobs/io.max
# Add the backup process to the group
echo $BACKUP_PID > /sys/fs/cgroup/backup-jobs/cgroup.procs
# Optional latency target (5 ms)
echo "8:0 target=5000" > /sys/fs/cgroup/backup-jobs/io.latency

10. Case Studies

10.1 Nightly Backup Slowing MySQL

Symptoms: MySQL slow‑query count spikes at 02:00, await rises from 0.5 ms to 15 ms.

Investigation:

iostat shows w_await jump.

iotop reveals MySQL and a tar backup both reading heavily.

Conclusion: Backup consumes most bandwidth, starving MySQL.

Solutions:

# Lower backup priority
ionice -c 3 nice -n 19 rsync -avz /var/lib/mysql/ backup:/backup/mysql
# Or cgroup bandwidth limit
echo "8:0 rbps=52428800 wbps=52428800" > /sys/fs/cgroup/backup/io.max
# Or rsync bandwidth limit
rsync -avz --bwlimit=50000 /var/lib/mysql/ backup:/backup/mysql

10.2 ext4 journal causing Write Amplification

Symptoms: Elasticsearch write throughput drops from 200 MB/s to 40 MB/s; w/s high but wareq‑sz only 4 KB.

Investigation:

iostat shows high write IOPS with tiny request size.

blktrace identifies most writes coming from jbd2 (ext4 journal).

tune2fs reveals mount option journal_data, which journals both data and metadata, doubling write volume.

Fix:

# Remount with ordered mode (only metadata journaled)
sudo mount -o remount,data=ordered /data
# Or set in /etc/fstab
/dev/sda1 /data ext4 defaults,noatime,data=ordered 0 2

10.3 NVMe %util 100 % but No Latency Impact

Alert: %util constantly 100 % on an NVMe, yet r_await stays at 0.08 ms.

Explanation:

NVMe can handle >500 k IOPS; observed 45 k IOPS is far below capacity. %util is calculated as (IO count × average service time) / interval. With many concurrent queues the metric reaches 100 % even when each request is fast.

Action: Stop using %util for NVMe alerts; monitor await and IOPS instead. Example Prometheus rule provided in the source.

11. Troubleshooting Workflow Summary

System slowdown → top %wa high → iostat await high →
   • NVMe: check await & IOPS (ignore %util)
   • HDD: check await + %util + aqu‑sz
→ iotop to find offending process →
   Analyze IO pattern (rareq‑sz, rareq‑sz) →
   If needed, blktrace + btt for stage latency →
   bpftrace for latency distribution →
   fio for baseline performance →
   Apply tuning (scheduler, readahead, dirty ratios, cgroup limits) →
   Verify with iostat / iotop again.

12. Tool Cheat Sheet

top : Quick view of %wa (negligible overhead).

iostat : Overall IOPS, latency, throughput (very low overhead).

sar : Historical trends from /var/log/sysstat/ (no runtime cost).

iotop : Real‑time per‑process IO (low overhead, root required).

pidstat : Scriptable per‑process IO stats.

ionice : Adjust process IO priority.

blktrace : Block‑layer request tracing (moderate data volume).

btt : Offline analysis of blktrace data.

fio : Disk benchmark (high load, use carefully).

biolatency : IO latency histogram (low overhead).

biosnoop : Per‑request trace with process name.

ext4slower / xfsslower : Filesystem‑level slow‑IO tracing.

13. Tuning Reference Table

IO Scheduler : /sys/block/*/queue/scheduler – NVMe none, HDD mq-deadline, multi‑tenant HDD bfq.

Readahead : blockdev --setra – 1 MB for sequential, 32 KB for random.

Dirty Ratios : vm.dirty_ratio (hard) and vm.dirty_background_ratio (soft) – DB: 5 % / 2 %; log‑heavy: 40 % / 20 %.

Dirty Expire / Writeback : vm.dirty_expire_centisecs and vm.dirty_writeback_centisecs – DB: 1000 / 100, logs: 6000 / 500.

Queue Depth : /sys/block/*/queue/nr_requests – increase to 1024 for high concurrency.

IO Merge : /sys/block/*/queue/nomerges – 2 for pure random, 0 for sequential.

Mount Options : noatime always; data=ordered for databases, data=writeback only with power‑loss protection.

14. Conclusions

Key take‑aways:

Master the Linux IO stack and use the layered approach: iostat → iotop → blktrace → bpftrace → fio.

For HDDs, %util is a reliable saturation indicator; for SSD/NVMe focus on await and IOPS.

Choose the right scheduler: none for NVMe, mq-deadline for SATA/HDD, bfq for shared HDD.

Adjust dirty_ratio and dirty_background_ratio to eliminate write‑latency spikes in databases.

Tailor readahead to workload (large for sequential, small for random).

Use cgroup v2 io.max / io.latency / io.weight to enforce fairness in multi‑tenant or container environments.

15. Further Learning

Linux Block IO Layer documentation – deep dive into blk‑mq architecture. iostat(1) man page – precise field definitions.

Brendan Gregg’s “BPF Performance Tools” – eBPF tracing for storage.

fio official documentation – exhaustive parameter guide.

Linux Storage Stack Diagram – visual overview of each layer.

io_uring kernel docs and liburing GitHub – modern asynchronous IO.

bcc/libbpf‑tools – source of biolatency, biosnoop, etc.

NVMe‑CLI – SMART data, firmware updates, namespace management.

Systems Performance (2nd ed.) by Brendan Gregg – classic performance analysis.

MonitoringPerformanceLinuxIOTuning
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.