How Alibaba Cloud’s One‑Click IO Diagnosis Tackles High‑Volume Storage Bottlenecks
The article explains how Alibaba Cloud OS Console’s one‑click IO diagnosis automatically monitors key IO metrics, computes dynamic thresholds, detects anomalies such as high latency or iowait, and provides root‑cause analysis and remediation suggestions to improve cloud storage performance in multi‑tenant environments.
Background
As cloud workloads grow, AI training data, logs, and multimedia cause exponential increases in IO volume, leading to a surge in storage requests. In multi‑tenant clouds, shared storage resources create IO contention, and hybrid/multi‑cloud architectures add complexity to troubleshooting.
Business Pain Points
Pain Point 1: Lack of diagnostic capability – Users cannot distinguish IO delay from saturation, requiring manual operator intervention.
Pain Point 2: Lost evidence – Traditional monitoring may miss the window when anomalies occur, making it hard to capture detailed diagnostic data.
Pain Point 3: Weak correlation between metrics and diagnosis – Isolated metrics (await, util, tps, bps) do not map directly to specific IO problems, forcing users to manually combine them.
Solution Overview
The Alibaba Cloud OS Console provides a SysOM component that implements a one‑click IO diagnosis workflow: discover → diagnose → root‑cause analysis. It periodically samples IO metrics, identifies anomalies with dynamic thresholds, and triggers specialized sub‑diagnostic tools.
Architecture
The system reads metrics such as await, util, tps, iops, qu‑size, and iowait at configurable intervals (cycle ms). When an anomaly is detected, it launches the appropriate diagnostic tool while limiting invocation frequency.
IO Metric Monitoring
IO metrics : await, util, tps, iops, qu‑size, iowait.
Anomaly detection : An indicator exceeding its dynamic threshold is flagged as abnormal.
Data cleaning & visualization : Diagnostic results are visualized with root‑cause and remediation suggestions.
Dynamic Threshold Calculation
Basic Threshold
Metrics are treated as time‑series; most of the time they are stable. A sliding window computes the maximum deviation from the mean (instantaneous fluctuation). The average of these fluctuations across windows forms the basic threshold, which adapts continuously.
Compensation Threshold
When the basic threshold declines, a compensation value is added to prevent rapid drops and reduce false positives. During stable periods, a “steady‑state compensation” is derived from low‑noise data; otherwise the current maximum basic threshold is used temporarily.
Minimum Threshold
A static minimum threshold (business‑defined floor) is compared with the dynamic threshold (basic + compensation). The larger of the two becomes the final anomaly threshold, ensuring that only values exceeding both business tolerance and normal fluctuation are flagged.
Anomaly Identification
Set a warning line as the maximum of the minimum static threshold and the dynamic threshold.
If a metric exceeds this line and diagnostic conditions are met, trigger immediate diagnosis.
The system continuously learns, updating the dynamic threshold with fresh data.
Intelligent Diagnosis Controls
Diagnosis cool‑down period triggerInterval: minimum interval between two diagnoses.
Abnormal accumulation counter reportInterval: if zero, a single anomaly after the cool‑down triggers diagnosis; otherwise, a certain number of anomalies within a time window are required.
Root‑Cause Analysis
IO Burst: reports the process contributing most IO during the burst, including buffer‑write‑to‑kworker cases.
High IO latency: shows latency distribution and the path with the highest delay.
High iowait: identifies the waiting process and the cause.
Case Studies
High iowait
The tool pinpointed a process waiting on disk IO, revealing excessive dirty pages. Recommendations included lowering dirty_ratio and dirty_background_ratio to reduce write‑back pressure.
High IO latency
A user observed high write latency; the one‑click diagnosis identified the DiskBlockWrite process as the source, with latency concentrated on dirty‑page flushing. Suggested actions were to reduce buffer writes or adjust dirty_ratio and dirty_bytes.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
