How Shanda Games Built a Scalable Automated Operations System
This article details Shanda Games' journey in designing and implementing a comprehensive automated operations platform—including installation, deployment, security, client and server updates, data analysis, backup, and monitoring—to efficiently manage hundreds of games across diverse hardware and operating systems.
Introduction
Xu Feng, a senior researcher at Shanda Games, introduces his background and the purpose of the talk: to share the design and practice of an automated operations system that addresses the "Why" and "What" of automation.
Why Build an Automated Operations System?
Shanda operates hundreds of games with complex architectures, multiple operating systems (Windows, Linux), and a wide variety of server hardware purchased over many years. Personnel skill levels vary, making a standardized, automated approach essential for efficiency, consistency, and security.
Automation Goals
Completeness – cover all operational needs.
Simplicity – easy to use and understand.
Efficiency – provide timely feedback for batch tasks.
Security – protect the system from takeover.
Subsystem Overview
1. Automated Installation System
Servers are installed via PXE, automatically detect OS type, install required drivers, and apply basic security settings such as firewall rules and disabled Windows sharing.
2. Automated Operations Platform
The platform serves as the operators' console, handling heterogeneous OS environments and large server fleets. It is browser‑based, uses SSH for both Linux and Windows management, and avoids custom agents to reduce maintenance overhead.
3. Automated Security Inspection System
Before files reach players, they undergo virus scanning; server‑side assets are checked via continuous security scans to prevent exposure of vulnerable ports or IPs.
4. Automated Client Update System
Handles peak‑time bandwidth spikes (hundreds of gigabits) and mitigates issues such as illegal caching by ISPs. Uses a multi‑CDN strategy with 302 redirects to balance traffic and employs HTTPS‑encrypted small‑file delivery (code‑named "Dorado") to bypass ISP caches.
5. Automated Server‑Side Update System
Adopts a CDN‑like model where target servers download updates from central nodes via cache servers, avoiding P2P due to security and traffic‑control concerns.
6. Automated Data Analysis System
Collects client download logs, aggregates them in a Tomcat cluster, stores results in MongoDB, and visualizes funnel‑style conversion from download to game login, helping identify failures and improve user experience.
7. Automated Data Backup System
Moves from scattered FTP‑to‑tape backups to a centralized solution: load‑balanced upload endpoints, MD5 verification, and storage in a Hadoop HDFS cluster (tens of PB) with UDP‑based transfer to tolerate high latency and packet loss.
8. Automated Monitoring and Alert System
Monitors IDC link quality, server health, network traffic, system logs, application metrics, and client SDK data. Business‑level indicators such as online player count trigger alerts when thresholds are breached.
Summary
The automation effort, spanning from 2000 to the present, emphasizes incremental development, scalability, and leveraging mature protocols rather than reinventing wheels. Small‑to‑medium companies are advised to start with targeted solutions, ensure extensibility, and prioritize practical, proven tools.
Q & A
Q: What software is used for the UDP‑based file transfer?
A: A custom‑built tool; commercial options exist but are costly. UDP is repurposed for file transfer by segmenting files, receiving fragments on the server, and requesting missing pieces, similar to HTTP range requests.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
