Cloud Computing 5 min read

How Neural Attention Detects Cluster-Wide Task Slowdowns in Cloud Systems

A new paper accepted at ACM SIGKDD2024 presents a neural‑network‑based framework that uses a skim‑attention mechanism and a picky loss function to accurately detect cluster‑wide task slowdown anomalies in large‑scale cloud platforms, achieving a 5.3% average F1‑score improvement over state‑of‑the‑art methods.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Neural Attention Detects Cluster-Wide Task Slowdowns in Cloud Systems

Alibaba Cloud's Big Data Engineering team, in collaboration with Zhejiang University, has had its paper “Cluster‑Wide Task Slowdown Detection in Cloud System” accepted by ACM SIGKDD 2024, the premier data‑mining conference.

The paper tackles the problem of detecting abnormal slowdowns of overall cluster workloads, which cannot be reliably identified by monitoring individual tasks due to virtual‑environment noise and resource overhead.

Key challenges include (1) interference from per‑task variability, (2) high computational cost of per‑task monitoring, and (3) training data contamination with unlabeled anomalies that violate unsupervised detection assumptions.

To address these, the authors propose a novel “skim‑attention” mechanism and a “picky loss” function that handle composite periodicity in workload distributions and mitigate training‑set pollution. An optimal‑transport‑based neural module then performs targeted detection of slowdown anomalies across the whole cluster.

Experiments show the method improves average F1‑score by 5.3 % compared with state‑of‑the‑art anomaly‑detection algorithms. The approach has already been deployed in a gray‑scale test on Alibaba Cloud’s native big‑data service MaxCompute, helping operators assess cluster health and anticipate risks.

Paper details: Title – Cluster‑Wide Task Slowdown Detection in Cloud System; Authors – Feiyi Chen, Yingying Zhang, Lunting Fan, Yuxuan Liang, Guansong Pang, Qingsong Wen, Shuiguang Deng; PDF – https://arxiv.org/abs/2408.04236.

References include recent works on multivariate time‑series anomaly detection such as Robust Anomaly Detection via Stochastic Recurrent Neural Networks (KDD 2019) and Anomaly Transformer (arXiv 2021).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

anomaly detectionNeural NetworksCluster Performance
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.