Operations 6 min read

How Alibaba Scales Anomaly Detection Across Millions of Metrics

This article explains how Alibaba tackles anomaly detection for tens of millions of metrics in a 100‑thousand‑machine cluster by comparing vertical time‑series methods with horizontal clustering, choosing DBSCAN for large‑scale monitoring, and detailing the ETL, computation, and visualization pipeline.

Alibaba Cloud Big Data AI Platform

Aug 5, 2025

How Alibaba Scales Anomaly Detection Across Millions of Metrics

Background

When a cluster reaches a scale of 100,000 machines and each machine reports over 100 metrics, the monitoring system must handle tens of millions of data points, making fast anomaly detection a critical challenge.

Typical problems include:

Abnormal machines are hard to spot.

The sheer number of metrics exceeds monitoring capacity.

Gray‑released machines may show hidden anomalies.

Construction Idea

Two main approaches are considered for anomaly detection:

Vertical comparison (time‑series) : Detect anomalies by comparing a metric against its historical values using time‑series algorithms.

Horizontal comparison (clustering) : Compare the same metric across different machines over a time window, using clustering techniques to spot outliers.

Choosing the right method involves evaluating scalability, timeliness, accuracy, and manual effort.

Horizontal clustering scores better on scalability and manual effort, while vertical comparison offers higher timeliness.

Technical Practice

The team selected the DBSCAN clustering algorithm because it can group metrics and identify outlier machine groups.

Implementation flow:

Data ETL: Use Alibaba Cloud MaxCompute to perform hourly and daily ETL on metric data.

Data Computation: Run DBSCAN with Python’s sklearn library.

Data Presentation: Visualize results on the business portal.

Visualization components include a PCA‑reduced 2‑D plot, a radar chart for cluster metric differences, and detailed cluster summaries to pinpoint problematic machines or metrics.

Application Effect

The solution uncovered high‑load machines that traditional monitoring missed, such as a subset of an online HDFS cluster where DataNodes were registered but not receiving data.

Conclusion

Horizontal clustering efficiently identifies machines with abnormal behavior, complementing vertical time‑series analysis. Combining both methods provides a more robust anomaly detection framework for large‑scale operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Clustering Anomaly Detection time series DBSCAN operational intelligence large-scale monitoring

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.