Cloud Native 13 min read

How Alibaba Cloud’s One‑Click I/O Diagnosis Tackles Cloud‑Native I/O Bottlenecks

This article explains how Alibaba Cloud CloudMonitor 2.0 integrates SysOM intelligent diagnosis to automatically detect, analyze, and remediate I/O anomalies in multi‑tenant cloud environments, detailing the architecture, dynamic threshold algorithm, anomaly‑trigger logic, and real‑world case studies.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Alibaba Cloud’s One‑Click I/O Diagnosis Tackles Cloud‑Native I/O Bottlenecks

Background

Rapid growth of AI training data, logs and media in cloud environments leads to a sharp increase in I/O request rates. In multi‑tenant, hybrid‑cloud or multi‑cloud deployments, concurrent access to shared storage creates I/O contention and performance bottlenecks, while the diversity of storage stacks makes fault localization difficult.

Key Technical Challenges

Ambiguous I/O anomaly types – Users cannot easily differentiate latency spikes, throughput saturation or other I/O issues, requiring expert intervention.

Insufficient real‑time evidence – Traditional monitoring captures generic OS metrics; by the time an alarm fires the root cause may have already passed.

Disconnected metric‑to‑diagnosis mapping – Isolated metrics must be manually correlated with diagnostic tools, increasing effort and error.

Solution Overview

Alibaba Cloud CloudMonitor 2.0 together with the SysOM intelligent diagnosis module implements a “detect‑analyze‑remediate” workflow for common I/O abnormal scenarios. The system follows a “monitor‑first, on‑demand capture” model: during a user‑specified time window it periodically reads I/O metrics, triggers a sub‑diagnostic tool when a metric exceeds a dynamic threshold, and generates a structured diagnostic report.

Architecture

The workflow consists of four stages:

Metric collection – At a configurable cycle (in milliseconds) the system reads key I/O metrics such as await, util, tps, iops, qu‑size and iowait.

Dynamic‑threshold anomaly detection – Collected values are compared against a three‑layer threshold (base, compensation, static minimum). An anomaly is flagged when the value exceeds the larger of the dynamic (base + compensation) and static thresholds.

Automatic diagnostic trigger – The system selects the appropriate sub‑diagnostic tool based on the metric type, applies frequency‑control parameters, and executes the analysis.

Result aggregation – Diagnostic output is summarized, visualized and presented with root‑cause insights and remediation suggestions.

Dynamic Threshold Mechanism

The threshold consists of three components:

Base threshold – A sliding‑window algorithm computes the maximum deviation of each data point from the window’s average (instantaneous fluctuation). The average of these fluctuations over consecutive windows forms an adaptive baseline.

Compensation threshold – Added to the base threshold to smooth rapid declines during quiet periods, preventing false alarms caused by normal noise.

Minimum static threshold – A business‑defined lower bound. The final alarm threshold is the greater of (base + compensation) and this static value.

This three‑layer design enables detection of short‑lived spikes while keeping false‑positive rates low.

Implementation Details

During each cycle the system performs:

# Pseudocode
while within_diagnosis_window:
    metrics = read_io_metrics()
    for m in metrics:
        if m.value > compute_dynamic_threshold(m):
            if can_trigger_diagnosis(m):
                run_subdiagnostic(m)
    sleep(cycle_ms)

Frequency Control

triggerInterval – Minimum interval (seconds) between two diagnoses of the same type to avoid repeated scans.

reportInterval – Number of anomaly occurrences required after the cool‑down period before a diagnosis is launched. When set to 0, any anomaly after the cool‑down triggers immediate diagnosis.

Root‑Cause Analysis

After data capture, the system automatically extracts structured insights:

Identify processes that contribute the most I/O (IO burst contributors).

Highlight paths or devices with the highest latency.

Pinpoint processes and reasons for elevated iowait (e.g., disk saturation, slow dirty‑page flushing).

Case Studies

High iowait

A customer observed overall response slowdown. The diagnostic report identified the task_server process waiting on disk I/O and recommended lowering dirty_ratio and dirty_bytes to reduce write‑back pressure.

High I/O latency

Another case showed sustained write‑latency spikes. The analysis pinpointed DiskBlockWrite as the dominant load and suggested adjusting dirty_ratio and dirty_background_ratio to control dirty‑page flushing, thereby reducing latency.

References

IO One‑Click Diagnosis: https://help.aliyun.com/zh/cms/cloudmonitor-2-0/io-key-diagnosis

SysOM System Diagnosis: https://cmsnext.console.aliyun.com/next/region/cn-shanghai/workspace/default-cms-1808078950770264-cn-shanghai/app/host/host-sysom

cloud-nativePerformance optimizationdiagnosticsdynamic thresholdsaliyunio monitoring
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.