How NetEase Games Built an AIOps Platform to Transform IT Operations

This article explains how NetEase Games leveraged AI, big data, and machine learning to create an AIOps platform that automates anomaly detection, log analysis, and fault localization, improving quality assurance, cost management, and operational efficiency across complex gaming infrastructures.

Efficient Ops
Efficient Ops
Efficient Ops
How NetEase Games Built an AIOps Platform to Transform IT Operations

According to Gartner, AIOps integrates big data and machine learning to process the growing volume, variety, and velocity of IT data, supporting quality assurance, cost management, and efficiency improvement in IT operations.

NetEase Games AIOps Roadmap

Since 2016, NetEase Games has built an intelligent monitoring team and platform, delivering features such as anomaly detection, prediction, correlation analysis, drill‑down analysis, log analysis, operation robots, fault location, fault warning, flame‑graph analysis, hardware prediction, and CDN file release.

Anomaly Detection

Anomaly detection is a foundational AIOps capability that uses AI algorithms to automatically and accurately identify outliers in monitoring data, offering easy configuration, high accuracy, broad coverage, and automatic updates compared with traditional threshold‑based methods.

Business Golden Metrics

These metrics (e.g., online player count) have strong periodicity, low variance, and strict precision/recall requirements, making supervised models attractive despite limited labeled data.

Sample Construction : Samples are drawn from historical KPI datasets and online user annotations, using unsupervised detectors like Isolation Forest to generate candidate anomalies, followed by manual labeling and stratified sampling.

Preprocessing : Includes curve classification via LSTM+CNN, missing‑value handling with linear and forward‑fill methods, and max‑min normalization, generating ~500 features for modeling.

Algorithm Model : Common models such as Random Forest, XGBoost, GBDT are combined with Logistic Regression for ensemble detection.

Visualization : Provides graphic alerts, quick annotation links, and anomaly views to streamline confirmation and labeling.

Performance Metrics

Performance metrics (e.g., CPU usage) are unsuitable for supervised models due to scale and diversity; unsupervised models are used instead, classifying anomalies into spike, drift, high‑frequency, and linear‑trend types.

Spike : Detected using differencing and SR algorithms.

Drift : Detected after STL decomposition with mean‑shift and robust regression.

High‑Frequency : Detected via multi‑step differencing.

Linear Trend : Detected using STL decomposition followed by linear regression and MK test for memory‑leak monitoring.

All models incorporate periodic suppression to avoid false alarms, achieving 80%+ recall for unsupervised detection.

Text Data (Log Analysis)

Massive daily logs (up to 10k+ entries) pose challenges for anomaly detection; intelligent log analysis uses big‑data and AI to classify logs, extract templates (using Drain and Spell algorithms), and detect anomalies based on template count deviations.

Template‑based anomaly detection compares historical distributions, reduces noise from minor fluctuations, and automatically selects top‑N impactful log categories.

Log classification also aids log governance by identifying and pruning ineffective log statements.

Fault Localization

Fault localization follows two stages: pre‑mitigation (rapid information for immediate action) and post‑mitigation (deep root‑cause analysis).

Resource Dimension

Machine : Analyzes recent metrics, scores anomalies, and ranks top‑N suspect machines based on timing, severity, and fault impact.

Network/Channel : Uses the Adtributor algorithm to drill down by region and carrier, producing top‑N abnormal dimensions.

SaaS : Leverages existing SaaS alerts to aggregate anomaly results.

Code

Code issues are identified through log classification and anomaly detection, surfacing top‑N abnormal log templates for rapid debugging.

Human Operation

Human‑initiated changes are linked to fault events; change events are correlated with preceding faults to trigger proactive alerts.

Historical Fault

Historical fault similarity is measured using Tanimoto coefficient; top‑N past faults with high similarity are recommended as probable root causes.

The overall fault‑localization workflow detects incidents, then analyzes resources, code, human actions, and historical faults to pinpoint root causes, such as a drop in online players leading to a network issue on a specific machine.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learninganomaly detectionlog analysisaiopsIT Operations
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.