Backend Development 15 min read

Design and Operational Practices for Game Platform Backend Systems

The article outlines the architecture, distributed design, redundancy, monitoring, automation, and fault‑handling strategies employed in a game company's platform backend to ensure high availability and efficient daily operations.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Operational Practices for Game Platform Backend Systems

Overview For many years I have worked in the platform department of a game company, handling the design, development, operation, and maintenance of various platform systems such as account, billing, and data services. This article shares the problems encountered and solutions devised in system architecture design and daily operational maintenance.

System Architecture

Distributed Platform traffic exhibits sharp spikes during promotion periods or in‑game events, often exceeding normal concurrency by tenfold. To handle this, a distributed solution is applied across three layers: data layer, logic layer, and data centers.

Data Layer The data layer includes databases (e.g., MySQL, MongoDB) and caches (e.g., Redis, Memcached). Sharding is performed by using the account as a hash key, splitting user data across multiple databases and tables, and adjusting table counts per server load. Read‑write separation, cache layers, and read‑only replicas are used for high‑read workloads, while asynchronous queues handle heavy write bursts.

Logic Layer All user‑basic data accesses (queries, updates) are routed through a unified middleware service that offers HTTP and TCP long‑connection interfaces, incorporating access control and permission modules. Core services are exposed to users via load‑balancing devices (NetScale) with policies that direct traffic appropriately.

Data Centers Deploying services across multiple data centers provides disaster recovery and improves load capacity, especially for non‑BGP bandwidth games where users are encouraged to connect to the nearest ISP‑proximate center. Data synchronization is achieved via master‑slave DB replication for infrequent user‑basic data updates, and order data is stored per‑center with cross‑center queries based on order IDs.

Redundancy The platform must maintain extremely high availability because its outage disables all games. Redundancy is considered both at the system and data levels. System‑level redundancy includes multiple data centers capable of serving all business functions. When a data center fails, DNS is switched and the faulty IP is removed from the service list. If the primary DB center goes offline, a short‑term read‑only degradation is applied; for longer outages, the primary DB is migrated to another center.

System Availability Availability is ensured by classifying core business importance and defining degradation strategies. For example, during critical periods, security checks (e.g., multi‑factor, abnormal login detection) can be temporarily disabled to allow simple credential login. Offline verification keys are used when platform servers cannot be reached. Payment notifications are reconciled periodically to handle delayed or missing orders, and third‑party interfaces are treated as untrusted, with fallback mechanisms and dedicated monitoring.

Daily Maintenance – Monitoring Monitoring covers two aspects: infrastructure (servers, network, ports, resources) and application‑level health (service correctness). User‑behavior monitoring simulates typical actions, records response correctness and latency, and aggregates key metrics (order count, login success, etc.) to detect anomalies. Interface monitoring checks direct API responses, while log monitoring aggregates server logs for error spikes or silence, triggering alerts. Redundant monitoring servers are deployed in separate data centers to guarantee continuous observability.

Automation System updates across distributed servers and data centers are automated using Puppet, pulling release versions from SVN and deploying them to target machines. Continuous integration includes gray releases, where a small user subset is routed via load balancers to the new version for validation before full rollout. Automation also extends to routine maintenance scripts, with critical operations requiring manual confirmation to mitigate risk.

Fault Handling – SOP Standard Operating Procedures (SOP) document step‑by‑step actions for routine incidents and disaster recovery, providing guidance for on‑site staff and reducing reliance on senior engineers. An example SOP for primary DB switch is illustrated in the accompanying image. Regular SOP reviews and updates are essential to keep procedures accurate.

Daily Drills Periodic drills (approximately every three months) are conducted to validate SOP effectiveness, ensuring that all teams interpret and execute the procedures consistently.

Outlook As the company evolves from PC to web to mobile games, platform requirements continuously change. Embracing new technologies and learning opportunities remains crucial for meeting future demands.

backendDistributed SystemsMonitoringAutomationGame Platformredundancy
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.