How AI Detects Cluster-Wide Task Slowdowns in Cloud Systems

A new AI‑driven method for detecting cluster‑wide task slowdowns in cloud platforms improves F1 score by 5.3% over state‑of‑the‑art techniques, addressing challenges of composite periodic patterns, training data contamination, and focusing on slowdown anomalies.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How AI Detects Cluster-Wide Task Slowdowns in Cloud Systems

Opening

Recently, the Alibaba Cloud Computing Platform big‑data engineering team, together with Zhejiang University, had their paper Cluster‑Wide Task Slowdown Detection in Cloud System accepted at the ACM SIGKDD2024 conference. The work introduces a neural‑network‑based approach for detecting overall job slowdown anomalies in cloud clusters, achieving an average 5.3% improvement in F1 score compared with existing SOTA anomaly‑detection algorithms.

Background

Detecting slowdown across the entire job distribution of a cluster is a branch of time‑series anomaly detection. Supervised methods require extensive manual labeling, which is impractical in many real scenarios, while unsupervised methods avoid this but suffer from training‑set contamination—normal data mixed with unlabeled anomalies. Recent unsupervised reconstruction‑based methods have progressed from RNN‑based backbones (e.g., OmniAnomaly, MSCRED) to transformer‑based models (e.g., AnomalyTransformer, DCdetector, TranAD). However, attention mechanisms tend to overlook low‑amplitude periodic signals, which are crucial because cluster‑wide job execution times exhibit composite periodicity.

Challenge

The slowdown detection task faces three main problems: (1) composite periodic signals of varying amplitudes are poorly captured by attention‑based networks; (2) training data in production often contain hidden anomalies, degrading unsupervised performance; (3) existing unsupervised detectors treat any deviation from normal distribution as an anomaly, whereas this task only cares about slowdown anomalies.

Breakthrough

To enhance attention’s handling of composite periodic information, we first analyze standard attention weight distribution and find it favors high‑amplitude cycles while ignoring low‑amplitude ones. We therefore propose a skim‑cycle method that iteratively reconstructs high‑amplitude signals and subtracts them from the original, feeding the residual into the next iteration.

To mitigate training‑set contamination, we introduce Picky Loss , which adaptively assigns higher weights to normal data and lower weights to anomalous data. It leverages the observation that normal points receive broader, more uniform attention, whereas anomalies attract concentrated attention; a Gaussian‑filtered attention sum quantifies normality, guiding the weighting.

For targeted slowdown detection, we design a Neural OT module that suppresses reconstruction of time slices corresponding to overall slowdown while preserving reconstruction quality for other slices, causing larger reconstruction errors for slowdown periods.

Application

The proposed algorithms have been gray‑released in Alibaba Cloud’s native big‑data compute service MaxCompute for cluster anomaly monitoring, helping operations teams assess cluster health and proactively identify potential risks.

Paper Information

● Paper title: Cluster‑Wide Task Slowdown Detection in Cloud System

● Authors: Feiyi Chen, Yingying Zhang, Lunting Fan, Yuxuan Liang, Guansong Pang, Qingsong Wen, Shuiguang Deng

● PDF link: https://arxiv.org/abs/2408.04236

● Selected references:

Su Y, Zhao Y, Niu C, et al. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. KDD 2019.

Zhang C, Song D, Chen Y, et al. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. AAAI 2019.

Xu J, Wu H, Wang J, et al. Anomaly transformer: Time series anomaly detection with association discrepancy. arXiv 2021.

Yang Y, Zhang C, Zhou T, et al. Dcdetector: Dual attention contrastive representation learning for time series anomaly detection. KDD 2023.

Tuli S, Casale G, Jennings N R. TranAD: Deep transformer networks for anomaly detection in multivariate time series data. arXiv 2022.

Breakthrough Illustration

Diagram
Diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

anomaly detectionNeural NetworksUnsupervised LearningTime Series
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.