Operations 11 min read

How Alibaba Cloud’s One‑Click IO Diagnosis Tackles High‑Volume Storage Bottlenecks

The article explains how Alibaba Cloud OS Console’s one‑click IO diagnosis automatically monitors key IO metrics, computes dynamic thresholds, detects anomalies such as high latency or iowait, and provides root‑cause analysis and remediation suggestions to improve cloud storage performance in multi‑tenant environments.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba Cloud’s One‑Click IO Diagnosis Tackles High‑Volume Storage Bottlenecks

Background

As cloud workloads grow, AI training data, logs, and multimedia cause exponential increases in IO volume, leading to a surge in storage requests. In multi‑tenant clouds, shared storage resources create IO contention, and hybrid/multi‑cloud architectures add complexity to troubleshooting.

Business Pain Points

Pain Point 1: Lack of diagnostic capability – Users cannot distinguish IO delay from saturation, requiring manual operator intervention.

Pain Point 2: Lost evidence – Traditional monitoring may miss the window when anomalies occur, making it hard to capture detailed diagnostic data.

Pain Point 3: Weak correlation between metrics and diagnosis – Isolated metrics (await, util, tps, bps) do not map directly to specific IO problems, forcing users to manually combine them.

Solution Overview

The Alibaba Cloud OS Console provides a SysOM component that implements a one‑click IO diagnosis workflow: discover → diagnose → root‑cause analysis. It periodically samples IO metrics, identifies anomalies with dynamic thresholds, and triggers specialized sub‑diagnostic tools.

Architecture

The system reads metrics such as await, util, tps, iops, qu‑size, and iowait at configurable intervals (cycle ms). When an anomaly is detected, it launches the appropriate diagnostic tool while limiting invocation frequency.

Architecture diagram
Architecture diagram

IO Metric Monitoring

IO metrics : await, util, tps, iops, qu‑size, iowait.

Anomaly detection : An indicator exceeding its dynamic threshold is flagged as abnormal.

Data cleaning & visualization : Diagnostic results are visualized with root‑cause and remediation suggestions.

Dynamic Threshold Calculation

Basic Threshold

Metrics are treated as time‑series; most of the time they are stable. A sliding window computes the maximum deviation from the mean (instantaneous fluctuation). The average of these fluctuations across windows forms the basic threshold, which adapts continuously.

Compensation Threshold

When the basic threshold declines, a compensation value is added to prevent rapid drops and reduce false positives. During stable periods, a “steady‑state compensation” is derived from low‑noise data; otherwise the current maximum basic threshold is used temporarily.

Minimum Threshold

A static minimum threshold (business‑defined floor) is compared with the dynamic threshold (basic + compensation). The larger of the two becomes the final anomaly threshold, ensuring that only values exceeding both business tolerance and normal fluctuation are flagged.

Basic threshold curve
Basic threshold curve
Compensation threshold
Compensation threshold

Anomaly Identification

Set a warning line as the maximum of the minimum static threshold and the dynamic threshold.

If a metric exceeds this line and diagnostic conditions are met, trigger immediate diagnosis.

The system continuously learns, updating the dynamic threshold with fresh data.

Intelligent Diagnosis Controls

Diagnosis cool‑down period triggerInterval: minimum interval between two diagnoses.

Abnormal accumulation counter reportInterval: if zero, a single anomaly after the cool‑down triggers diagnosis; otherwise, a certain number of anomalies within a time window are required.

Root‑Cause Analysis

IO Burst: reports the process contributing most IO during the burst, including buffer‑write‑to‑kworker cases.

High IO latency: shows latency distribution and the path with the highest delay.

High iowait: identifies the waiting process and the cause.

Case Studies

High iowait

The tool pinpointed a process waiting on disk IO, revealing excessive dirty pages. Recommendations included lowering dirty_ratio and dirty_background_ratio to reduce write‑back pressure.

iowait case
iowait case

High IO latency

A user observed high write latency; the one‑click diagnosis identified the DiskBlockWrite process as the source, with latency concentrated on dirty‑page flushing. Suggested actions were to reduce buffer writes or adjust dirty_ratio and dirty_bytes.

latency case
latency case
diagnosticsAlibaba Cloudcloud operationsdynamic thresholdsio monitoring
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.