Operations 60 min read

Production‑Grade Linux Disk I/O Tuning: From Theory to Hands‑On Practice

This comprehensive guide walks you through the fundamentals of Linux disk I/O performance, explains how to interpret key metrics such as IOPS, throughput and latency, and provides step‑by‑step instructions, scripts and configuration examples for diagnosing bottlenecks, optimizing filesystems, kernel parameters, application settings and storage layouts in production environments.

Ops Community
Ops Community
Ops Community
Production‑Grade Linux Disk I/O Tuning: From Theory to Hands‑On Practice

Disk I/O Core Concepts

Performance is described by three metrics: IOPS (operations per second), throughput (bytes per second) and latency (time per operation). They are related by IOPS × average I/O size = throughput and latency × concurrent I/Os ≈ response time. Understanding these relationships is the basis for any I/O analysis.

Storage Media Characteristics

HDD (7200 RPM) : 100‑150 IOPS, 100‑200 MB/s, 5‑15 ms latency.

SATA SSD : 50 k+ IOPS, 500‑600 MB/s, 0.1‑0.5 ms latency.

NVMe SSD : 200 k+ IOPS, >3 GB/s, 0.02‑0.2 ms latency.

Choose the device that matches the workload: random‑intensive workloads need SSD/NVMe; large sequential streams can use HDD or high‑capacity SSD.

Filesystem Impact

EXT4 : default on many distros, good for general use, moderate random‑write performance, supports data=writeback for maximum speed (risk of data loss).

XFS : excellent parallel I/O, better for databases and high‑concurrency workloads.

Btrfs : copy‑on‑write, snapshots, compression; still maturing for production.

tmpfs/ramfs : in‑memory, ideal for temporary data when persistence is not required.

Common mount options that reduce metadata overhead are noatime and nodiratime. For performance‑critical paths you may disable barriers ( nobarrier) and use data=writeback, but only when a reliable power‑loss protection mechanism exists.

I/O Scheduler Selection

noop : simple FIFO, best for SSD/NVMe.

deadline : deadline‑driven, good for latency‑sensitive databases on HDD or SSD.

cfq : fair‑queueing, generally not optimal for SSD.

mq‑deadline / kyber : multiqueue versions for modern NVMe devices.

Set the scheduler per device with a udev rule, e.g. ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop" for non‑rotational disks.

Performance Metrics and Interpretation

%util : device busy time; >80 % (HDD) or >90 % (SSD) indicates saturation.

await : average request latency; >100 ms (HDD) or >10 ms (SSD) is a warning.

avgqu‑sz : average queue depth; >4 suggests moderate load, >8 signals severe queuing.

avgrq‑sz : average request size; small values (<8 KB) mean many tiny random I/Os.

Combine these numbers to identify the bottleneck type (device saturation, queue buildup, metadata bottleneck, or latency jitter).

Bottleneck Diagnosis Flow

1. Observe symptom (high latency, low TPS, high iowait)
2. Verify iowait with <strong>top</strong>
3. Identify the busy device with <strong>iostat -xcz 1</strong>
4. Locate the offending process using <strong>iotop</strong> or <strong>pidstat</strong>
5. Classify the I/O pattern (read/write ratio, request size, queue depth)
6. Apply targeted fixes (scheduler, mount options, kernel params, app tuning)
7. Re‑measure and confirm improvement

Optimization Techniques

Filesystem Level

Use XFS for high‑concurrency databases; EXT4 for general purpose.

Mount with noatime,nodiratime at minimum.

Consider nobarrier,data=writeback only on systems with battery‑backed caches.

Run fsck after unmount if corruption is suspected.

Kernel Parameters

Adjust dirty‑page settings for write‑heavy workloads:

vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500

Set vm.swappiness = 10 to reduce swap pressure.

Increase queue depth if the device supports it: echo 256 > /sys/block/sda/queue/nr_requests.

Tune read‑ahead based on workload (e.g., 128 KB for random reads, 4 MB for large sequential streams).

Application Tuning

Batch small I/Os into larger requests.

Use asynchronous I/O where possible.

Enable in‑memory caches (e.g., MySQL innodb_buffer_pool_size = 60‑80 % of RAM).

MySQL specific settings for I/O‑intensive workloads:

[mysqld]
innodb_flush_log_at_trx_commit = 2   # trade‑off: up to 1 s data loss
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_flush_method = O_DIRECT
innodb_buffer_pool_size = 12G
innodb_buffer_pool_instances = 4

Redis logging options: appendonly yes with appendfsync everysec for a good balance of durability and latency.

Common Diagnostic Tools

iostat – device‑level I/O statistics.

# Show extended stats every second
iostat -x 1

iotop – per‑process I/O usage.

# Interactive view
iotop
# One‑shot, only active processes
iotop -b -o

pidstat – I/O per PID.

# Continuous per‑process I/O
pidstat -d 1

fio – synthetic benchmark for random, sequential, and mixed workloads.

# 4 KB random read test
fio --name=randread --ioengine=libaio --iodepth=4 \
    --rw=randread --bs=4k --size=1G --runtime=60 \
    --filename=/dev/sdb1

smartctl – SMART health checks.

# Quick health summary
smartctl -H /dev/sda

Monitoring and Alerting

Deploy node_exporter (listens on 9100) to expose node_disk_* metrics.

Key Prometheus queries:

# Device utilization (>80 %)
rate(node_disk_io_time_seconds_total[5m]) * 100
# Average read latency (ms)
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) * 1000
# Throughput
rate(node_disk_read_bytes_total[5m]) / 1024 / 1024

Sample Alertmanager rules (warning thresholds):

- alert: DiskIOHighUtilization
  expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Disk I/O utilization high"
    description: "Device {{ $labels.device }} util is {{ $value }}%"

- alert: DiskIOHighLatency
  expr: (rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) * 1000) > 50
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Disk I/O latency high"
    description: "Device {{ $labels.device }} read latency {{ $value }} ms"

Risk Management and Rollback

Classify changes:

Low risk – read‑only queries, monitoring tweaks.

Medium risk – mount‑option changes, sysctl tweaks (apply during low‑traffic windows).

High risk – re‑formatting, filesystem conversion, RAID re‑configuration.

Backup before any high‑risk operation: configuration files, database dumps, raw data snapshots.

Rollback steps:

Restore original /etc/fstab and remount.

Re‑apply previous sysctl values.

Re‑mount with original options.

Restart affected services and verify functionality.

Case Study Highlights

MySQL latency spike : %util 98 % on data SSD, await 120 ms, innodb_flush_log_at_trx_commit=1. Fix – change to 2, enlarge redo log, and batch writes. Result – %util dropped below 70 %, latency <10 ms.

Log server write bottleneck : %util 95 % on HDD, many small log files, no compression. Fix – enable logrotate compression, move logs to SSD, configure rsyslog async queue ( $ActionFileEnableSync off). Result – %util ~45 %, no back‑pressure.

Small‑file storage overload : inode exhaustion, single directory with millions of 4 KB files on EXT4. Fix – reformat with larger -i 4096 or migrate to XFS (dynamic inode allocation) and restructure directories into a three‑level hash. Result – inode usage <20 %, directory listings fast.

Quick Checklist

Verify device type and match scheduler (noop for SSD, deadline for HDD).

Mount with noatime,nodiratime; add nobarrier only if power‑loss protection exists.

Set kernel dirty‑page parameters appropriate for workload.

Monitor %util, await, avgqu‑sz, avgrq‑sz regularly.

Establish alert thresholds based on storage class (e.g., HDD await >30 ms, SSD await >5 ms).

Run baseline iostat and fio tests before changes.

Document all configuration changes and keep rollback scripts ready.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringperformance tuningLinuxDisk I/OFilesystem
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.