Cloud Native 13 min read

How Alibaba Cloud’s One‑Click I/O Diagnosis Detects and Resolves Storage Anomalies

This article explains how Alibaba Cloud CloudMonitor 2.0 integrates SysOM intelligent diagnostics to automatically detect, analyze, and remediate I/O performance issues in multi‑tenant, hybrid‑cloud environments by using dynamic thresholds, a monitor‑first on‑demand capture architecture, and automated root‑cause reporting.

Alibaba Cloud Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
How Alibaba Cloud’s One‑Click I/O Diagnosis Detects and Resolves Storage Anomalies

Background

Rapid growth of AI training data, logs and media in cloud environments leads to a massive increase in I/O request rates. In shared multi‑tenant storage, high concurrency causes resource contention, and heterogeneous hybrid‑cloud deployments make fault localization difficult due to inconsistent storage policies.

Pain Points

Unclear I/O anomaly type : Users cannot easily tell whether latency, throughput saturation or other I/O issues are occurring, forcing reliance on experts.

Insufficient evidence at failure time : Traditional monitoring only captures OS‑level metrics; by the time an alarm fires the root cause may have already passed.

Disconnected metric‑diagnosis mapping : Dashboards show isolated metrics without direct correlation to specific I/O fault categories, requiring manual cross‑referencing.

Solution Overview

Alibaba Cloud CloudMonitor 2.0 combined with SysOM intelligent diagnosis implements a “detect‑analyze‑remediate” workflow for common I/O anomaly scenarios. The service continuously captures key I/O metrics, applies dynamic thresholds, triggers targeted sub‑diagnostic tools, and returns a consolidated report with root‑cause analysis and actionable recommendations.

Architecture

The system follows a “monitor‑first, on‑demand capture” model. During a user‑specified diagnostic window, CloudMonitor periodically reads I/O indicators (await, util, tps, iops, qu‑size, iowait). When a metric exceeds its dynamic alarm line, a corresponding diagnostic tool is invoked, forming a closed loop from detection to root‑cause identification.

Dynamic Threshold Mechanism

Metric Collection

At each configured cycle (in milliseconds) the service gathers I/O indicators such as iowait, iops, bps, qusize, await and util.

Base Threshold

A sliding time window computes the maximum deviation of each point from the window’s average, producing an “instant fluctuation value”. Averaging these values across consecutive windows yields a continuously adapting base threshold that reflects recent normal variance.

Compensation Threshold

When the base threshold declines rapidly during stable periods, a compensation value is added to avoid classifying normal micro‑fluctuations as anomalies.

Minimum Static Threshold

A predefined static floor (the business‑acceptable lower bound) is compared with the dynamic threshold; the larger of the two becomes the final alarm line, ensuring only truly abnormal spikes trigger alerts.

Anomaly Detection Strategy

Establish alarm baseline : For each metric, set the alarm line to the greater of the static floor and the dynamic threshold.

Trigger diagnosis : If a metric exceeds the baseline and satisfies additional conditions (duration, repeat count), the corresponding diagnostic workflow starts.

Model update : New data continuously refines the dynamic thresholds, adapting to evolving workload patterns.

Intelligent Diagnosis and Frequency Control

Diagnosis cool‑down (triggerInterval) : Minimum interval between two diagnoses to avoid excessive scanning.

Exception accumulation (reportInterval) : Number of anomaly occurrences required within a time window before a diagnosis is launched; a value of 0 means immediate execution after the cool‑down.

Root‑Cause Analysis

IO Burst : Identifies processes contributing most to I/O during the anomaly window, including kernel kworker activity.

IO Latency : Shows latency distribution and highlights the slowest path (device or file).

High iowait : Lists processes causing excessive wait and explains underlying reasons such as disk saturation or slow dirty‑page flushing.

Case Studies

iowait High

A customer reported overall slowness; the one‑click diagnosis pinpointed the task_server process waiting on disk I/O and recommended lowering dirty_ratio and dirty_bytes to reduce dirty‑page pressure.

iowait high case diagram
iowait high case diagram

IO Latency High

Another case showed sustained write latency. Diagnosis revealed DiskBlockWrite as the main load, with most time spent in the dirty‑page flush stage. Recommendations included reducing buffer‑write bursts and tuning dirty_ratio and dirty_background_ratio to lower write latency.

IO latency case diagram
IO latency case diagram

References

IO One‑Click Diagnosis – https://help.aliyun.com/zh/cms/cloudmonitor-2-0/io-key-diagnosis

CloudMonitor‑ECS Insight – SysOM System Diagnosis – https://cmsnext.console.aliyun.com/next/region/cn-shanghai/workspace/default-cms-1808078950770264-cn-shanghai/app/host/host-sysom

OS Console Instance Management – https://help.aliyun.com/zh/alinux/user-guide/system-management

MonitoringPerformancecloud-nativeoperationsdynamic-thresholdio-diagnosis
Alibaba Cloud Observability
Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.