Operations 22 min read

How NetEase Scales Game Monitoring to Billions: Architecture, Data, and AI

This article details NetEase's game monitoring system that supports billions of users worldwide, covering global monitoring challenges, a layered observability architecture, massive time‑series processing, visualisation and alerting mechanisms, and intelligent AI‑driven anomaly detection practices.

Efficient Ops
Efficient Ops
Efficient Ops
How NetEase Scales Game Monitoring to Billions: Architecture, Data, and AI

1. Global Game Monitoring Challenges

Traditional game servers were monolithic, single‑machine deployments on physical hardware, focusing on domestic markets with simple monitoring layers (hardware, network, OS, process, business metrics). Today, game architectures have diversified to distributed and micro‑service designs, running on hybrid infrastructures that include private clouds, public clouds, and containers, leading to a complex, global monitoring landscape.

NetEase now operates in dozens of countries, managing dozens of regions and multiple cloud providers, which introduces significant scaling and reliability challenges. The monitoring stack has evolved from basic alerts to full observability, adding debugging and profiling capabilities.

The current architecture follows a data‑flow pipeline: data is collected at the edge (SDKs, agents, logs, third‑party DBs), routed through regional ingress points, and decoupled via a Kafka queue before being processed centrally.

2. Massive Time‑Series Data Processing

To handle heterogeneous, massive time‑series data, NetEase abstracts monitoring objects into Entity and EntityType concepts, using tags for attributes. Over 100 EntityTypes, 5 million entities, and hundreds of millions of time‑series are managed.

An Arbiter service subscribes to CMDB changes, generates monitoring configurations, and distributes them to regional nodes. Agents query the Arbiter to discover their region and node list, then maintain long‑lived connections for efficient data transfer and configuration updates.

Collected data enters a Kafka topic, is pre‑processed (cleaning, alignment), and flows into a Flink aggregator that applies user‑defined aggregation rules. Processed data is stored in a tiered system: recent data in Redis (6‑hour cache), medium‑term data in MongoDB (1 min, 5 min, 30 min, 1 day granularity), and long‑term archives in HDFS.

3. Data Visualization and Alerting

NetEase provides over 200 custom views (e.g., machine view, container view, K8s‑Pod‑Container hierarchy) that map EntityTypes and Tags into tree‑structured dashboards. Users can define aggregation rules so that higher‑level nodes automatically summarise data from their children.

Alerting supports threshold, rate‑of‑change, custom anomaly messages, and composite alerts. Templates enable sharing and subscription across projects, reducing configuration effort. A convergence engine merges alerts based on time windows, project, and CMDB‑derived topology to minimise noise.

Notification chains ensure alerts reach on‑call engineers via multiple channels (messaging, email, SMS) with escalation and automatic suppression once an incident is acknowledged.

4. Intelligent Monitoring Practices

Beyond static thresholds, NetEase employs AI for anomaly detection. Data is extracted, labelled, and pre‑processed (resampling, standardisation, de‑identification). Over 360 features are used to train unsupervised (IsolationForest) and supervised models (LSTM, DNN, tree ensembles). An ensemble of XGBoost feature selection, SVM, RF, GBDT, and logistic‑regression stacking achieves ~85 % precision on ten‑thousand test curves.

To avoid cross‑scenario contamination, curves are classified by statistical characteristics (e.g., volatility) and separate models are trained per class. Correlation analysis links alarms to related time‑series: after an alarm triggers, relevant curves are selected via CMDB metadata, correlation scores are computed, and ranked results are presented to operators for root‑cause analysis.

A novel approach treats events as the primary entity, comparing pre‑ and post‑event sub‑sequences of a target curve with random sub‑sequences to infer causal relationships, reducing reliance on direct curve‑to‑curve correlation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeObservabilityAI anomaly detectiongame monitoring
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.