How Alibaba Cloud’s One‑Click I/O Diagnosis Tackles Cloud‑Native I/O Bottlenecks
This article explains how Alibaba Cloud CloudMonitor 2.0 integrates SysOM intelligent diagnosis to automatically detect, analyze, and remediate I/O anomalies in multi‑tenant cloud environments, detailing the architecture, dynamic threshold algorithm, anomaly‑trigger logic, and real‑world case studies.
Background
Rapid growth of AI training data, logs and media in cloud environments leads to a sharp increase in I/O request rates. In multi‑tenant, hybrid‑cloud or multi‑cloud deployments, concurrent access to shared storage creates I/O contention and performance bottlenecks, while the diversity of storage stacks makes fault localization difficult.
Key Technical Challenges
Ambiguous I/O anomaly types – Users cannot easily differentiate latency spikes, throughput saturation or other I/O issues, requiring expert intervention.
Insufficient real‑time evidence – Traditional monitoring captures generic OS metrics; by the time an alarm fires the root cause may have already passed.
Disconnected metric‑to‑diagnosis mapping – Isolated metrics must be manually correlated with diagnostic tools, increasing effort and error.
Solution Overview
Alibaba Cloud CloudMonitor 2.0 together with the SysOM intelligent diagnosis module implements a “detect‑analyze‑remediate” workflow for common I/O abnormal scenarios. The system follows a “monitor‑first, on‑demand capture” model: during a user‑specified time window it periodically reads I/O metrics, triggers a sub‑diagnostic tool when a metric exceeds a dynamic threshold, and generates a structured diagnostic report.
Architecture
The workflow consists of four stages:
Metric collection – At a configurable cycle (in milliseconds) the system reads key I/O metrics such as await, util, tps, iops, qu‑size and iowait.
Dynamic‑threshold anomaly detection – Collected values are compared against a three‑layer threshold (base, compensation, static minimum). An anomaly is flagged when the value exceeds the larger of the dynamic (base + compensation) and static thresholds.
Automatic diagnostic trigger – The system selects the appropriate sub‑diagnostic tool based on the metric type, applies frequency‑control parameters, and executes the analysis.
Result aggregation – Diagnostic output is summarized, visualized and presented with root‑cause insights and remediation suggestions.
Dynamic Threshold Mechanism
The threshold consists of three components:
Base threshold – A sliding‑window algorithm computes the maximum deviation of each data point from the window’s average (instantaneous fluctuation). The average of these fluctuations over consecutive windows forms an adaptive baseline.
Compensation threshold – Added to the base threshold to smooth rapid declines during quiet periods, preventing false alarms caused by normal noise.
Minimum static threshold – A business‑defined lower bound. The final alarm threshold is the greater of (base + compensation) and this static value.
This three‑layer design enables detection of short‑lived spikes while keeping false‑positive rates low.
Implementation Details
During each cycle the system performs:
# Pseudocode
while within_diagnosis_window:
metrics = read_io_metrics()
for m in metrics:
if m.value > compute_dynamic_threshold(m):
if can_trigger_diagnosis(m):
run_subdiagnostic(m)
sleep(cycle_ms)Frequency Control
triggerInterval – Minimum interval (seconds) between two diagnoses of the same type to avoid repeated scans.
reportInterval – Number of anomaly occurrences required after the cool‑down period before a diagnosis is launched. When set to 0, any anomaly after the cool‑down triggers immediate diagnosis.
Root‑Cause Analysis
After data capture, the system automatically extracts structured insights:
Identify processes that contribute the most I/O (IO burst contributors).
Highlight paths or devices with the highest latency.
Pinpoint processes and reasons for elevated iowait (e.g., disk saturation, slow dirty‑page flushing).
Case Studies
High iowait
A customer observed overall response slowdown. The diagnostic report identified the task_server process waiting on disk I/O and recommended lowering dirty_ratio and dirty_bytes to reduce write‑back pressure.
High I/O latency
Another case showed sustained write‑latency spikes. The analysis pinpointed DiskBlockWrite as the dominant load and suggested adjusting dirty_ratio and dirty_background_ratio to control dirty‑page flushing, thereby reducing latency.
References
IO One‑Click Diagnosis: https://help.aliyun.com/zh/cms/cloudmonitor-2-0/io-key-diagnosis
SysOM System Diagnosis: https://cmsnext.console.aliyun.com/next/region/cn-shanghai/workspace/default-cms-1808078950770264-cn-shanghai/app/host/host-sysom
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
