Intelligent Anomaly Detection for Operations Maintenance: Machine Learning Methods and Workflow
This article explains the importance of operations maintenance, outlines the challenges of traditional rule‑based anomaly detection, and describes how machine‑learning‑driven AIOps—including feature engineering, unsupervised and supervised models—can provide more accurate, scalable, and automated detection of server anomalies.
Operations maintenance is essential for providing stable services, yet servers frequently encounter issues such as memory leaks, disk saturation, and software errors that can cause service interruptions.
Traditional rule‑based anomaly detection relies on manually crafted thresholds for metrics like CPU usage or online player count, which often fail to capture subtle or evolving problems and require high maintenance effort.
By leveraging machine learning, intelligent anomaly detection can automatically learn patterns from large, labeled datasets, reducing the need for extensive domain expertise and improving detection across diverse scenarios.
Defining abnormal data in AIOps is challenging; common heuristics compare current metrics with those from the same time period on previous days or weeks, but precise definitions remain ambiguous.
Feature engineering for the detection model includes differences, skewness, kurtosis, standard deviation, variance, volatility, moving averages, predictive and categorical features, fitting features, high‑pass filters, wavelet features, periodic comparisons, and statistical descriptors.
The modeling pipeline consists of sampling curve data, de‑duplicating by type, preprocessing (handling missing values, normalization, and splitting into today/yesterday/last‑week segments), extracting features, training models, and visualizing results.
Three modeling approaches are discussed:
Unsupervised methods such as the Three‑Sigma rule and Isolation Forest, which quickly flag outliers but may produce false alarms not aligned with business expectations.
Supervised ensemble models (XGBoost, Random Forest, GBDT, SVM, Logistic Regression) that incorporate feedback loops, allowing mis‑detections to be labeled and used for iterative improvement.
The supervised workflow includes sampling, manual labeling, building a labeled sample library, feature calculation, model training, and anomaly visualization, followed by continuous feedback to refine the model.
In summary, machine‑learning‑based anomaly detection offers lower development and maintenance costs, higher precision, better generalization, and greater automation compared to manual rule systems, making it a superior solution for modern operations maintenance.
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
