How NetEase Games Built an AIOps Platform to Transform IT Operations
This article explains how NetEase Games leveraged AI, big data, and machine learning to create an AIOps platform that automates anomaly detection, log analysis, and fault localization, improving quality assurance, cost management, and operational efficiency across complex gaming infrastructures.
According to Gartner, AIOps integrates big data and machine learning to process the growing volume, variety, and velocity of IT data, supporting quality assurance, cost management, and efficiency improvement in IT operations.
NetEase Games AIOps Roadmap
Since 2016, NetEase Games has built an intelligent monitoring team and platform, delivering features such as anomaly detection, prediction, correlation analysis, drill‑down analysis, log analysis, operation robots, fault location, fault warning, flame‑graph analysis, hardware prediction, and CDN file release.
Anomaly Detection
Anomaly detection is a foundational AIOps capability that uses AI algorithms to automatically and accurately identify outliers in monitoring data, offering easy configuration, high accuracy, broad coverage, and automatic updates compared with traditional threshold‑based methods.
Business Golden Metrics
These metrics (e.g., online player count) have strong periodicity, low variance, and strict precision/recall requirements, making supervised models attractive despite limited labeled data.
Sample Construction : Samples are drawn from historical KPI datasets and online user annotations, using unsupervised detectors like Isolation Forest to generate candidate anomalies, followed by manual labeling and stratified sampling.
Preprocessing : Includes curve classification via LSTM+CNN, missing‑value handling with linear and forward‑fill methods, and max‑min normalization, generating ~500 features for modeling.
Algorithm Model : Common models such as Random Forest, XGBoost, GBDT are combined with Logistic Regression for ensemble detection.
Visualization : Provides graphic alerts, quick annotation links, and anomaly views to streamline confirmation and labeling.
Performance Metrics
Performance metrics (e.g., CPU usage) are unsuitable for supervised models due to scale and diversity; unsupervised models are used instead, classifying anomalies into spike, drift, high‑frequency, and linear‑trend types.
Spike : Detected using differencing and SR algorithms.
Drift : Detected after STL decomposition with mean‑shift and robust regression.
High‑Frequency : Detected via multi‑step differencing.
Linear Trend : Detected using STL decomposition followed by linear regression and MK test for memory‑leak monitoring.
All models incorporate periodic suppression to avoid false alarms, achieving 80%+ recall for unsupervised detection.
Text Data (Log Analysis)
Massive daily logs (up to 10k+ entries) pose challenges for anomaly detection; intelligent log analysis uses big‑data and AI to classify logs, extract templates (using Drain and Spell algorithms), and detect anomalies based on template count deviations.
Template‑based anomaly detection compares historical distributions, reduces noise from minor fluctuations, and automatically selects top‑N impactful log categories.
Log classification also aids log governance by identifying and pruning ineffective log statements.
Fault Localization
Fault localization follows two stages: pre‑mitigation (rapid information for immediate action) and post‑mitigation (deep root‑cause analysis).
Resource Dimension
Machine : Analyzes recent metrics, scores anomalies, and ranks top‑N suspect machines based on timing, severity, and fault impact.
Network/Channel : Uses the Adtributor algorithm to drill down by region and carrier, producing top‑N abnormal dimensions.
SaaS : Leverages existing SaaS alerts to aggregate anomaly results.
Code
Code issues are identified through log classification and anomaly detection, surfacing top‑N abnormal log templates for rapid debugging.
Human Operation
Human‑initiated changes are linked to fault events; change events are correlated with preceding faults to trigger proactive alerts.
Historical Fault
Historical fault similarity is measured using Tanimoto coefficient; top‑N past faults with high similarity are recommended as probable root causes.
The overall fault‑localization workflow detects incidents, then analyzes resources, code, human actions, and historical faults to pinpoint root causes, such as a drop in online players leading to a network issue on a specific machine.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
