Cloud Computing 19 min read

Dynamic Scaling Practices for NetEase Game Operations on AWS

The article details NetEase's experience designing and implementing dynamic server scaling for overseas games on AWS, comparing Auto Scaling and GameLift, describing a custom scaling platform, the challenges faced, lessons learned, and future directions for cloud‑based game operations.

NetEase Game Operations Platform
NetEase Game Operations Platform
NetEase Game Operations Platform
Dynamic Scaling Practices for NetEase Game Operations on AWS

NetEase game operations engineer Du Junchao shares the team’s journey of building dynamic scaling solutions for overseas game servers, presented at AWS Game Tech Day Guangzhou.

The talk first identifies game types suitable for auto‑scaling, emphasizing titles with clear player‑count peaks and competitive large‑world servers, while noting that traditional MMORPGs are less amenable due to shared logical servers.

Auto Scaling on EC2 is examined: a launch configuration defines instance templates, and scaling policies can be driven by CPU metrics. However, limitations such as aggregate‑average thresholds, lack of business‑specific metrics, immediate termination on scale‑in, and CloudWatch aggregation caps make it unsuitable for game workloads.

GameLift is then evaluated. It offers game‑specific features like session‑aware scaling, target‑tracking and rule‑based policies, cooldown periods, session protection, and capacity limits. Despite its advantages, the need to embed the GameLift SDK in server code and maintain divergent codebases for domestic and overseas deployments led the team to reject it.

Consequently, NetEase built a custom scaling‑strategy platform that combines ideas from Auto Scaling and GameLift. The architecture includes a monitoring agent on each game server, a Consul cluster tracking server states, a strategy module that evaluates metrics against configurable rules, and an execution module that invokes EC2 APIs via a hybrid‑cloud management layer.

The platform supports flexible rule composition, multi‑metric calculations, custom scaling amounts, and integrates both standard host metrics and game‑specific indicators exposed via APIs.

Real‑world results show that dynamic scaling aligns server count with player concurrency, reducing costs especially for games with large peak‑valley differences. The team also discusses optimization directions such as predictive scaling, scheduled scaling plans, and trend‑based decision making.

Several operational challenges are described: capacity shortages in specific AZs, “ping‑pong” scaling oscillations with high‑core instances, CPU instruction‑set incompatibilities on low‑spec instances, increased failure rates when splitting servers, and instance‑host retirement handling.

Key lessons include randomizing AZ placement, monitoring CPU instruction set differences, verifying instance failover migration, and implementing automated fault‑recovery using CloudWatch alarms, SNS notifications, and custom remediation jobs.

Future work aims to add predictive scaling, reduce CMDB dependencies for instance lifecycle, and explore container‑based scheduling for scaling.

cloud computingautoscalingAWSDynamic ScalingGame OperationsGameLift
NetEase Game Operations Platform
Written by

NetEase Game Operations Platform

The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.