Operations 11 min read

How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks

The article explains how Alibaba Cloud’s OS console implements a one‑click IO diagnostic that automatically detects, classifies, and resolves high‑latency, burst, and iowait IO issues in multi‑tenant cloud environments by using dynamic thresholds, periodic metric collection, and targeted root‑cause analysis.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks

Background

As cloud workloads grow, AI training data, logs, and media cause exponential increases in storage I/O, leading to sharp rises in I/O request volume. In multi‑tenant clouds, shared storage resources create contention, and hybrid‑cloud architectures make I/O troubleshooting difficult. Alibaba Cloud’s OS console therefore introduced a one‑click I/O diagnostic covering detection, root‑cause analysis, and remediation.

Key Pain Points

Pain Point 1 – Lack of Issue Type Identification

Users cannot distinguish between I/O latency spikes, full‑capacity issues, or high iowait, forcing reliance on operators and increasing labor costs. The one‑click tool focuses on frequent issues such as high latency, traffic anomalies, and high iowait, automatically classifying them.

Pain Point 2 – Lost Evidence and Hard-to‑Capture Traces

Traditional monitoring alerts on metric spikes may miss the window to capture detailed diagnostics, making it hard to obtain fine‑grained evidence. Rapid identification and action are essential to seize the diagnostic window.

Pain Point 3 – Fragmented Metrics and Weak Correlation

Existing metrics exist in silos; correlating util, await, iops, and bps requires manual mapping and prior knowledge of diagnostic tools. The one‑click solution abstracts these complexities and delivers a direct analysis report.

Solution Overview

The OS console’s SysOM component already supports diagnosing high latency, traffic anomalies, and high iowait. The one‑click diagnostic periodically reads I/O metrics, detects anomalies via dynamic thresholds, and triggers specialized sub‑diagnostic tools, forming a closed loop of “detect → diagnose → root‑cause”.

Architecture

Dynamic thresholds adapt to varying business scenarios, avoiding static‑threshold false positives. The architecture diagram (see image) illustrates metric collection, threshold calculation, anomaly detection, and reporting.

IO diagnostic architecture diagram
IO diagnostic architecture diagram

Implementation Details

Metric Collection

When triggered, the system reads await, util, tps, iops, bps, qusize, and iowait at configurable intervals (cycle ms).

Dynamic Threshold Calculation

Three steps produce the final threshold:

Basic Threshold : Using a sliding window, compute the maximum deviation from the mean as an instantaneous fluctuation, then average these fluctuations to obtain a self‑adjusting baseline.

Compensation Threshold : Adds a buffer to the baseline to prevent rapid drops and reduce false alarms, based on the observed stable range.

Minimum Threshold : A static lower bound; the final anomaly threshold is the greater of the minimum static value and the dynamic (basic + compensation) value.

Dynamic threshold calculation
Dynamic threshold calculation

Anomaly Detection

If any metric exceeds the dynamic threshold, an anomaly is flagged. The system determines a warning line as the maximum of the minimum static threshold and the dynamic threshold, then triggers diagnosis when the line is crossed.

Smart Diagnosis

To avoid excessive runs, two parameters control frequency:

triggerInterval : Minimum interval between consecutive diagnoses (cool‑down period).

reportInterval : Number of anomalies required within a window to trigger diagnosis when non‑zero.

Root‑Cause Analysis

The diagnostic tool extracts relevant logs and correlates them with the offending process. Example outputs include:

For IO burst, the process contributing most IO and any kworker buffer‑flush activity.

For high latency, the latency distribution and the path with the greatest delay.

For high iowait, the waiting process and its cause.

Case Studies

High iowait

The tool identified a task_server process waiting on disk IO due to excessive dirty pages. Recommendations included lowering dirty_ratio and dirty_bytes to reduce write‑back pressure.

High IO latency

Monitoring revealed elevated write traffic latency. The diagnostic pinpointed the DiskBlockWrite process and suggested reducing buffer IO or adjusting dirty_ratio and dirty_background_ratio to alleviate the delay.

Case study illustration
Case study illustration
Performance MonitoringAlibaba Cloudcloud operationsdynamic thresholdssystem observabilityIO diagnostics
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.