Design and Implementation of an AI‑Driven Intelligent Operations Platform for Game Services
The article presents a comprehensive overview of an AI‑ops platform for game operations, covering its background, roadmap, team structure, business scenarios, anomaly‑detection techniques, platform architecture, detection workflow, model deployment, and intelligent fault‑management strategies.
Background
AIops (Intelligent Operations) was introduced by Gartner in 2016 as Algorithm IT Operations, applying machine‑learning, data‑warehouse and big‑data techniques to operational data (logs, monitoring, etc.) to automate decision‑making and eventually achieve fully autonomous operations.
AIops Roadmap
AIops Capability Stages
The enterprise‑level AIops white‑paper defines five maturity stages, which the NetEase Game Intelligent Operations team has validated through practice:
Initial AI experiments with no mature single‑point applications.
Single‑scenario AI‑ops capability delivering internal reusable components.
Process‑oriented AI‑ops by chaining multiple single‑scenario modules for reliable external services.
Fully automated, intervention‑free AI‑ops across major scenarios, offering reliable AI‑ops services.
Core AI hub that balances cost, quality and efficiency to meet lifecycle‑specific KPI targets, achieving multi‑objective optimality.
Since 2018 the team has applied AI to online‑user count, anomaly detection, and log anomaly detection, progressing toward a fully orchestrated, multi‑scenario AI‑ops capability.
Personnel Structure
Compared with traditional DevOps, AIops introduces algorithm engineers who may also have platform development skills. The team adopts a three‑role model: operations engineers (users), platform development engineers (engineer the platform), and algorithm engineers (focus on model research, development and tuning).
Business Domains
Time‑Series Anomaly Detection
Leverages historical trends to predict and detect anomalies without manual thresholds.
Fault Localization and Root‑Cause Analysis
Uses data‑mining to automatically extract fault features and locate root causes in complex micro‑service environments.
Text Processing and Analysis
Provides NLP capabilities such as information extraction, semantic analysis, intelligent search and dialogue systems.
Clustering and Similarity Analysis
Applies supervised or unsupervised clustering to group similar objects and reduce data dimensionality.
Anomaly Detection
Problem
Static thresholds are insufficient for dynamic business scenarios; AIops uses machine‑learning with human‑labeled data to learn adaptive thresholds, improving precision and recall.
Applicable Scenarios
Undefined Anomaly Thresholds
Difficulty distinguishing normal from abnormal data.
Thresholds vary over time.
Sudden spikes may not cross predefined thresholds.
Anomalies manifest as pattern deviations rather than magnitude.
High Manual Configuration Cost
Many curves require individual thresholds.
Business changes demand frequent threshold updates.
Noisy data requires extensive human inspection.
The team adopts an unsupervised statistical‑plus‑rule solution for online models, offering low latency, strong interpretability and minimal dependence on curve granularity.
Spike Anomalies (毛刺异常)
Transient, non‑periodic spikes that are hard to catch with simple thresholds; first‑order differencing and SR algorithm help expose them.
Surge/Drop Anomalies (突升突降异常)
Persistent deviations where KPI values stay elevated or depressed; detected via mean‑shift across windows, with STL decomposition and ESD filtering to reduce false positives.
Frequency Change Anomalies (频率变化异常)
Continuous high‑frequency spikes identified by counting first‑order differencing anomalies within a sliding window.
Anomaly Decision Methods
(1) Distribution‑based method (e.g., 3‑sigma, box‑plot) for instantaneous anomalies. (2) Business‑count threshold method for sustained anomalies, using count‑based alerts.
Model outputs are fused into a 0‑1 score; scores > 0.5 trigger alerts, providing robust cross‑scenario detection.
Platform Construction
AIops follows an iterative development model and faces challenges such as blurred boundaries between algorithm and engineering, and tight coupling of algorithm packages with platform releases.
The solution is a five‑layer architecture:
Data Ingestion Layer : Real‑time collection of monitoring, system and business logs via agents. Data Layer : Stores raw data in HDFS; hot data in TSDB, cold data in Redis; performs ETL and aggregation. Service Layer : Offline model training, model storage in S3, independent configuration and deployment. Application Layer : Hosts SaaS‑style functional apps; AIops runs as a real‑time stream processing service. Presentation Layer : Visualizes detection results as events or data graphs.
Detection Workflow Design
Users create detection tasks via the operations portal; configurations are stored in a Flink rule database. Agents stream metrics to Flink for preprocessing, after which an orchestration module enriches data with model, feature and strategy metadata, dynamically loads the appropriate model, and outputs results.
Algorithm models are registered as plug‑in packages stored in S3; the platform loads them on demand using Python class loading, enabling hot‑deployment and version management.
Historical data retrieval is abstracted: platform engineers implement data adapters, while algorithm modules request data via a standardized protocol, allowing seamless access to both hot and cold storage.
Intelligent Fault Management
Complex game architectures generate diverse alerts; fault isolation suffers from long trace chains, scattered alerts, and reliance on expert experience.
The proposed approach clusters alerts, correlates them with metrics, logs, tracing and change events, and ranks potential root causes using a scoring system that combines anomaly scores with time decay.
Business‑level SLO metrics trigger a two‑model uplift anomaly detector (mean‑shift and predictive‑mean‑shift). Detected anomalies initiate a 20‑minute window scan of machine‑level metrics, compute anomaly scores, apply decay, and rank machines and metrics to surface the most likely root cause.
Future work will detail the integration of log anomaly detection and cross‑SaaS metric correlation.
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.