Operations 20 min read

Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System

The article presents a business‑oriented, three‑layer high‑availability architecture for a large‑scale game access platform, detailing measurable goals, client‑side retry with HTTP‑DNS, functional separation and degradation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid fault detection, isolation, and recovery.

Architecture Digest

Jul 19, 2016

Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System

When discussing high‑availability (HA) architecture, most focus on system‑level structures such as master‑slave, cluster, or multi‑data‑center designs, but technology alone cannot guarantee 100% uptime; real‑world incidents show that even robust systems fail.

The author distinguishes between "system‑oriented" HA and "business‑oriented" HA, advocating the latter, which aligns HA goals with actual business impact rather than merely technical redundancy.

For Alibaba's Jiuyou game access system, which handles login, registration, payment, and CP verification, the team set a concrete HA target: locate issues within 3 minutes, restore services within 5 minutes, and limit incidents to at most one every two months—equivalent to a 4‑nine (99.99%) availability metric.

To meet this target, three sub‑goals were defined: (1) prevent problems as much as possible, (2) quickly locate problems, and (3) quickly recover business, emphasizing business recovery over root‑cause elimination.

The solution includes client‑side retry combined with HTTP‑DNS: clients retry failed requests, while HTTP‑DNS replaces unreliable traditional DNS during incidents, providing fast address updates without the latency of full DNS propagation.

Function separation and degradation were introduced: core functions (e.g., login, registration) are isolated from non‑core functions (e.g., messaging, logging) both logically and physically, allowing non‑core services to be disabled instantly via a backend control panel when issues arise.

For multi‑active deployment, the team moved from traditional cross‑city or same‑city multi‑data‑center models to a business‑driven approach that uses asynchronous data distribution, secondary reads, and deterministic data generation to achieve eventual consistency while preserving real‑time consistency for critical data.

A three‑dimensional, automated, visual monitoring system was built to collect metrics from the business layer, application service layer, interface call layer, infrastructure component layer, and hardware layer, enabling fault information to be displayed instantly and reducing the time to locate issues to the required three minutes.

Overall, the architecture combines business‑focused HA goals, client‑side resilience, functional isolation, intelligent multi‑region data handling, and comprehensive monitoring to meet the stringent availability requirements of a large‑scale game platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring operations distributed-systems high-availability system-design fault-tolerance

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.