How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks
The article explains how Alibaba Cloud’s OS console implements a one‑click IO diagnostic that automatically detects, classifies, and resolves high‑latency, burst, and iowait IO issues in multi‑tenant cloud environments by using dynamic thresholds, periodic metric collection, and targeted root‑cause analysis.
Background
As cloud workloads grow, AI training data, logs, and media cause exponential increases in storage I/O, leading to sharp rises in I/O request volume. In multi‑tenant clouds, shared storage resources create contention, and hybrid‑cloud architectures make I/O troubleshooting difficult. Alibaba Cloud’s OS console therefore introduced a one‑click I/O diagnostic covering detection, root‑cause analysis, and remediation.
Key Pain Points
Pain Point 1 – Lack of Issue Type Identification
Users cannot distinguish between I/O latency spikes, full‑capacity issues, or high iowait, forcing reliance on operators and increasing labor costs. The one‑click tool focuses on frequent issues such as high latency, traffic anomalies, and high iowait, automatically classifying them.
Pain Point 2 – Lost Evidence and Hard-to‑Capture Traces
Traditional monitoring alerts on metric spikes may miss the window to capture detailed diagnostics, making it hard to obtain fine‑grained evidence. Rapid identification and action are essential to seize the diagnostic window.
Pain Point 3 – Fragmented Metrics and Weak Correlation
Existing metrics exist in silos; correlating util, await, iops, and bps requires manual mapping and prior knowledge of diagnostic tools. The one‑click solution abstracts these complexities and delivers a direct analysis report.
Solution Overview
The OS console’s SysOM component already supports diagnosing high latency, traffic anomalies, and high iowait. The one‑click diagnostic periodically reads I/O metrics, detects anomalies via dynamic thresholds, and triggers specialized sub‑diagnostic tools, forming a closed loop of “detect → diagnose → root‑cause”.
Architecture
Dynamic thresholds adapt to varying business scenarios, avoiding static‑threshold false positives. The architecture diagram (see image) illustrates metric collection, threshold calculation, anomaly detection, and reporting.
Implementation Details
Metric Collection
When triggered, the system reads await, util, tps, iops, bps, qusize, and iowait at configurable intervals (cycle ms).
Dynamic Threshold Calculation
Three steps produce the final threshold:
Basic Threshold : Using a sliding window, compute the maximum deviation from the mean as an instantaneous fluctuation, then average these fluctuations to obtain a self‑adjusting baseline.
Compensation Threshold : Adds a buffer to the baseline to prevent rapid drops and reduce false alarms, based on the observed stable range.
Minimum Threshold : A static lower bound; the final anomaly threshold is the greater of the minimum static value and the dynamic (basic + compensation) value.
Anomaly Detection
If any metric exceeds the dynamic threshold, an anomaly is flagged. The system determines a warning line as the maximum of the minimum static threshold and the dynamic threshold, then triggers diagnosis when the line is crossed.
Smart Diagnosis
To avoid excessive runs, two parameters control frequency:
triggerInterval : Minimum interval between consecutive diagnoses (cool‑down period).
reportInterval : Number of anomalies required within a window to trigger diagnosis when non‑zero.
Root‑Cause Analysis
The diagnostic tool extracts relevant logs and correlates them with the offending process. Example outputs include:
For IO burst, the process contributing most IO and any kworker buffer‑flush activity.
For high latency, the latency distribution and the path with the greatest delay.
For high iowait, the waiting process and its cause.
Case Studies
High iowait
The tool identified a task_server process waiting on disk IO due to excessive dirty pages. Recommendations included lowering dirty_ratio and dirty_bytes to reduce write‑back pressure.
High IO latency
Monitoring revealed elevated write traffic latency. The diagnostic pinpointed the DiskBlockWrite process and suggested reducing buffer IO or adjusting dirty_ratio and dirty_background_ratio to alleviate the delay.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
