Production‑Grade Linux Disk I/O Tuning: From Theory to Hands‑On Practice
This comprehensive guide walks you through the fundamentals of Linux disk I/O performance, explains how to interpret key metrics such as IOPS, throughput and latency, and provides step‑by‑step instructions, scripts and configuration examples for diagnosing bottlenecks, optimizing filesystems, kernel parameters, application settings and storage layouts in production environments.
Disk I/O Core Concepts
Performance is described by three metrics: IOPS (operations per second), throughput (bytes per second) and latency (time per operation). They are related by IOPS × average I/O size = throughput and latency × concurrent I/Os ≈ response time. Understanding these relationships is the basis for any I/O analysis.
Storage Media Characteristics
HDD (7200 RPM) : 100‑150 IOPS, 100‑200 MB/s, 5‑15 ms latency.
SATA SSD : 50 k+ IOPS, 500‑600 MB/s, 0.1‑0.5 ms latency.
NVMe SSD : 200 k+ IOPS, >3 GB/s, 0.02‑0.2 ms latency.
Choose the device that matches the workload: random‑intensive workloads need SSD/NVMe; large sequential streams can use HDD or high‑capacity SSD.
Filesystem Impact
EXT4 : default on many distros, good for general use, moderate random‑write performance, supports data=writeback for maximum speed (risk of data loss).
XFS : excellent parallel I/O, better for databases and high‑concurrency workloads.
Btrfs : copy‑on‑write, snapshots, compression; still maturing for production.
tmpfs/ramfs : in‑memory, ideal for temporary data when persistence is not required.
Common mount options that reduce metadata overhead are noatime and nodiratime. For performance‑critical paths you may disable barriers ( nobarrier) and use data=writeback, but only when a reliable power‑loss protection mechanism exists.
I/O Scheduler Selection
noop : simple FIFO, best for SSD/NVMe.
deadline : deadline‑driven, good for latency‑sensitive databases on HDD or SSD.
cfq : fair‑queueing, generally not optimal for SSD.
mq‑deadline / kyber : multiqueue versions for modern NVMe devices.
Set the scheduler per device with a udev rule, e.g. ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop" for non‑rotational disks.
Performance Metrics and Interpretation
%util : device busy time; >80 % (HDD) or >90 % (SSD) indicates saturation.
await : average request latency; >100 ms (HDD) or >10 ms (SSD) is a warning.
avgqu‑sz : average queue depth; >4 suggests moderate load, >8 signals severe queuing.
avgrq‑sz : average request size; small values (<8 KB) mean many tiny random I/Os.
Combine these numbers to identify the bottleneck type (device saturation, queue buildup, metadata bottleneck, or latency jitter).
Bottleneck Diagnosis Flow
1. Observe symptom (high latency, low TPS, high iowait)
2. Verify iowait with <strong>top</strong>
3. Identify the busy device with <strong>iostat -xcz 1</strong>
4. Locate the offending process using <strong>iotop</strong> or <strong>pidstat</strong>
5. Classify the I/O pattern (read/write ratio, request size, queue depth)
6. Apply targeted fixes (scheduler, mount options, kernel params, app tuning)
7. Re‑measure and confirm improvementOptimization Techniques
Filesystem Level
Use XFS for high‑concurrency databases; EXT4 for general purpose.
Mount with noatime,nodiratime at minimum.
Consider nobarrier,data=writeback only on systems with battery‑backed caches.
Run fsck after unmount if corruption is suspected.
Kernel Parameters
Adjust dirty‑page settings for write‑heavy workloads:
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500Set vm.swappiness = 10 to reduce swap pressure.
Increase queue depth if the device supports it: echo 256 > /sys/block/sda/queue/nr_requests.
Tune read‑ahead based on workload (e.g., 128 KB for random reads, 4 MB for large sequential streams).
Application Tuning
Batch small I/Os into larger requests.
Use asynchronous I/O where possible.
Enable in‑memory caches (e.g., MySQL innodb_buffer_pool_size = 60‑80 % of RAM).
MySQL specific settings for I/O‑intensive workloads:
[mysqld]
innodb_flush_log_at_trx_commit = 2 # trade‑off: up to 1 s data loss
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_flush_method = O_DIRECT
innodb_buffer_pool_size = 12G
innodb_buffer_pool_instances = 4Redis logging options: appendonly yes with appendfsync everysec for a good balance of durability and latency.
Common Diagnostic Tools
iostat – device‑level I/O statistics.
# Show extended stats every second
iostat -x 1iotop – per‑process I/O usage.
# Interactive view
iotop
# One‑shot, only active processes
iotop -b -opidstat – I/O per PID.
# Continuous per‑process I/O
pidstat -d 1fio – synthetic benchmark for random, sequential, and mixed workloads.
# 4 KB random read test
fio --name=randread --ioengine=libaio --iodepth=4 \
--rw=randread --bs=4k --size=1G --runtime=60 \
--filename=/dev/sdb1smartctl – SMART health checks.
# Quick health summary
smartctl -H /dev/sdaMonitoring and Alerting
Deploy node_exporter (listens on 9100) to expose node_disk_* metrics.
Key Prometheus queries:
# Device utilization (>80 %)
rate(node_disk_io_time_seconds_total[5m]) * 100
# Average read latency (ms)
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) * 1000
# Throughput
rate(node_disk_read_bytes_total[5m]) / 1024 / 1024Sample Alertmanager rules (warning thresholds):
- alert: DiskIOHighUtilization
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Disk I/O utilization high"
description: "Device {{ $labels.device }} util is {{ $value }}%"
- alert: DiskIOHighLatency
expr: (rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) * 1000) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Disk I/O latency high"
description: "Device {{ $labels.device }} read latency {{ $value }} ms"Risk Management and Rollback
Classify changes:
Low risk – read‑only queries, monitoring tweaks.
Medium risk – mount‑option changes, sysctl tweaks (apply during low‑traffic windows).
High risk – re‑formatting, filesystem conversion, RAID re‑configuration.
Backup before any high‑risk operation: configuration files, database dumps, raw data snapshots.
Rollback steps:
Restore original /etc/fstab and remount.
Re‑apply previous sysctl values.
Re‑mount with original options.
Restart affected services and verify functionality.
Case Study Highlights
MySQL latency spike : %util 98 % on data SSD, await 120 ms, innodb_flush_log_at_trx_commit=1. Fix – change to 2, enlarge redo log, and batch writes. Result – %util dropped below 70 %, latency <10 ms.
Log server write bottleneck : %util 95 % on HDD, many small log files, no compression. Fix – enable logrotate compression, move logs to SSD, configure rsyslog async queue ( $ActionFileEnableSync off). Result – %util ~45 %, no back‑pressure.
Small‑file storage overload : inode exhaustion, single directory with millions of 4 KB files on EXT4. Fix – reformat with larger -i 4096 or migrate to XFS (dynamic inode allocation) and restructure directories into a three‑level hash. Result – inode usage <20 %, directory listings fast.
Quick Checklist
Verify device type and match scheduler (noop for SSD, deadline for HDD).
Mount with noatime,nodiratime; add nobarrier only if power‑loss protection exists.
Set kernel dirty‑page parameters appropriate for workload.
Monitor %util, await, avgqu‑sz, avgrq‑sz regularly.
Establish alert thresholds based on storage class (e.g., HDD await >30 ms, SSD await >5 ms).
Run baseline iostat and fio tests before changes.
Document all configuration changes and keep rollback scripts ready.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
