Operations 18 min read

How to Evaluate and Optimize System I/O Performance: Models, Tools, and Best Practices

This article explains how to assess I/O capabilities, choose appropriate evaluation tools, monitor key performance metrics, and apply targeted optimization techniques for both disk and network I/O to improve system throughput and latency.

Efficient Ops
Efficient Ops
Efficient Ops
How to Evaluate and Optimize System I/O Performance: Models, Tools, and Best Practices

1. Prerequisites for Evaluating I/O Capability

Before assessing a system's I/O capability, you must understand its I/O model. An I/O model describes the mix of read/write ratios, I/O sizes, and other characteristics specific to a workload, which is essential for capacity planning and problem analysis.

(1) I/O Model

In real workloads, I/O characteristics such as read/write ratio and I/O size fluctuate. Therefore, a model is usually built for a particular scenario to guide capacity planning and troubleshooting.

The basic model includes:

IOPS

Bandwidth

I/O size

For disk I/O, additional factors are needed:

Which disks are used

Read‑to‑write ratio

Whether reads are sequential or random

Whether writes are sequential or random

(2) Why Refine an I/O Model?

Different models yield different maximum values for IOPS, bandwidth (MB/s), and response time on the same storage or LUN.

When storage vendors quote maximum IOPS, they usually use small random I/O, which consumes little bandwidth but shows higher latency. Switching to sequential small I/O raises IOPS further, while large sequential I/O consumes high bandwidth but results in low IOPS.

Thus, capacity planning and performance tuning must start with an analysis of the workload’s I/O model.

2. Evaluation Tools

(1) Disk I/O Evaluation Tools

Common tools include Orion, Iometer, dd, xdd, iorate, iozone, and Postmark. Their OS support and use cases differ.

Some tools simulate application workloads; for example, Orion (from Oracle) mimics Oracle database I/O patterns, allowing you to specify read/write ratios, I/O size, and sequential or random behavior.

Orion can also run in an automatic mode that discovers the workload’s I/O model and reports peak IOPS, MB/s, and corresponding response times.

In contrast, dd simply reads/writes files without simulating any application behavior.

Postmark performs file creation, deletion, and read/write operations, suitable for testing small‑file workloads.

(2) Network I/O Evaluation Tools

Ping – basic, can set packet size.

Iperf, ttcp – measure maximum TCP/UDP bandwidth, latency, and packet loss.

Windows‑specific tools: NTttcp, LANBench, pcattcp, LAN Speed Test (Lite), NETIO, NetStress.

3. Key Monitoring Metrics and Common Tools

(1) Disk I/O

On Unix/Linux, Nmon and iostat are popular.

Nmon is useful for post‑analysis; iostat provides real‑time data and can be scripted for later analysis.

IOPS

Total IOPS – Nmon DISK_SUMM Sheet: IO/Sec

Read IOPS per disk – Nmon DISKRIO Sheet

Write IOPS per disk – Nmon DISKWIO Sheet

Command‑line iostat –tps (total), rps (read), wps (write)

Bandwidth

Total bandwidth – Nmon DISK_SUMM Sheet: Disk Read KB/s, Disk Write KB/s

Read bandwidth per disk – Nmon DISKREAD Sheet

Write bandwidth per disk – Nmon DISKWRITE Sheet

Command‑line iostat – bps (total), bread (read), bwrtn (write)

Response Time

Read latency – iostat –Dl: read‑avg‑serv, max‑serv

Write latency – iostat –Dl: write‑avg‑serv, max‑serv

Other

Disk utilization, queue depth, queue full counts, etc.

(2) Network I/O

Bandwidth

Prefer checking traffic directly on network devices; if not possible, monitor on the server.

Nmon – NET Sheet

topas – Network: BPS, B‑In, B‑Out

Response Time

Simple check: ping latency and packet loss.

For more accurate measurement, capture the SYN‑SYNACK handshake timing.

Best practice: use dedicated network probes at switch ports for precise analysis.

4. Performance Diagnosis and Optimization

(1) How to Tune Disk I/O Contention?

Typical problem: When I/O contention dominates, what are the main tuning ideas and techniques?

Step 1 – Identify the source of contention. Determine whether excessive I/O originates from the application layer or the system layer.

If the application generates unnecessary reads/writes, address it first. Examples:

Increase the sort buffer size in a database to reduce disk‑memory swaps.

Lower log level or disable non‑essential logging; add hints such as “no logging” where possible.

Step 2 – Analyze storage side. I/O traverses a long chain: application → memory cache → block device → HBA → driver → network switch → storage front‑end → storage cache → RAID group → disks.

Break the chain into three segments and investigate each:

Host side (application, cache, block layer, HBA, driver)

Network side (switches, links)

Storage side (front‑end, cache, RAID, disks)

Typical host‑side checks include queue depth, driver limits, DMA buffer size, and HBA concurrency.

If the host side is clear, move on to network and storage analysis.

Network side issues may involve bandwidth saturation, mis‑configured or faulty switches, multipathing errors, electromagnetic interference, or faulty fiber/cables.

Storage side problems include disk performance limits, RAID bandwidth caps, insufficient stripe width, thin vs. thick LUN provisioning, cache size/type, QoS limits, snapshot/clone overhead, controller CPU overload, or LUN formatting delays.

(2) Tuning Advice for Low‑Latency Transactions and High‑Speed Trading

Consider business requirements for log persistence and durability to choose an appropriate I/O strategy.

If logging is unnecessary, eliminate write I/O.

For low‑importance logs, write a single copy; for high‑importance logs, use double‑write or multiple copies.

Storage‑medium recommendations:

Use all‑SSD storage.

Use SSD as a second‑level cache (first level is RAM).

Implement tiered storage: hot data on SSD/SAS, colder data on slower disks.

Deploy RAMDISK for ultra‑low latency.

Increase cache size on the LUN’s storage server.

Configuration tips:

Select an appropriate RAID level (e.g., RAID10) for the workload.

Ensure stripe depth matches I/O size and provides sufficient concurrency.

I/O path considerations:

Prefer high‑speed networking (e.g., Fibre Channel) over slower protocols like iSCSI.

(3) Network I/O Diagnosis Methodology

Similar to disk I/O, network I/O requires segment‑by‑segment analysis. Use packet capture tools to locate latency or loss, then drill down to the problematic segment.

(4) Cases Misidentified as I/O Problems

Case 1 – Oracle buffer‑busy waits dominate total wait time. The AWR report showed “buffer busy waits” as the top event, which can be caused by various underlying waits such as log‑sync. Further analysis revealed index contention from heavy concurrent INSERTs. Partitioning the index eliminated both buffer‑busy waits and index contention.

Case 2 – Intermittent ping latency spikes. A client‑server pair exhibited 100‑200 ms ping spikes during peak load, even when idle the latency was 30‑40 ms. After exhaustive hardware swaps, packet captures showed the delay originated on the server side, which was a heavily partitioned LPAR. Reducing the number of active LPARs or assigning dedicated CPUs to critical LPARs removed the latency spikes.

These examples illustrate that apparent I/O issues may actually stem from database design or OS‑level resource scheduling.

storage optimizationCapacity Planningmonitoring toolsnetwork latencyIO performance
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.