Threshold‑Free Business Metric Monitoring Using Machine Learning
This article describes how a machine‑learning‑driven monitoring system replaces fixed thresholds with personalized, anomaly‑based detection for business‑level metrics such as network traffic and access volume, detailing the architecture, sample labeling, model training, alarm grading, and operational benefits.
In the practice of monitoring business services, macro‑level indicators like data‑center network traffic and business access volume often reflect system health more accurately than low‑level metrics such as CPU load or disk usage. However, these indicators typically show large, irregular daily fluctuations, making fixed‑threshold monitoring ineffective.
To address this, the team introduced a machine‑learning approach that enables threshold‑free monitoring of key business metrics, achieving efficient and accurate anomaly detection.
The system architecture consists of three layers: a data layer for historical data storage and offline model training, a core layer for real‑time data distribution, detection, and alarm generation, and a presentation layer that visualizes curves and detection results. An annotation function allows manual verification of anomaly labels.
In the offline module, unlabeled raw samples are first labeled using a combination of statistical discrimination and unsupervised learning, creating a labeled sample pool for feature training. The online module loads the model, receives real‑time data, performs anomaly detection, and feeds newly confirmed anomalies back into the labeled pool to continuously improve the model.
For sample labeling, a large amount of unlabeled time‑series data is processed by combining statistical discrimination with unsupervised learning voting, producing high‑confidence normal and abnormal samples for the training set.
Model training uses LightGBM on a balanced dataset (normal samples are down‑sampled). Features are engineered by comparing current values with statistical features from a reference sample pool covering the same minute of the current day, previous day, and previous week, with special handling for holidays.
After training, the model labels all original unlabeled data, and the monitoring UI is used to verify labeling accuracy, ensuring a labeling precision above 90% in production.
The alarm grading strategy distinguishes between normal anomalies, serious anomalies, and steep‑change anomalies. Simple spikes are filtered by requiring consecutive abnormal points; serious anomalies are identified by a high standard score and trigger medium‑priority alerts (e.g., email, SMS); steep‑change anomalies are detected via a steep‑change coefficient and trigger high‑priority alerts such as voice calls.
In summary, by integrating machine‑learning techniques into key business metric monitoring, the system achieves efficient, accurate, threshold‑free detection that adapts to diverse services, scales with increasing server counts, and continuously evolves as more data and AI methods are incorporated.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.