Operations 23 min read

Designing a Business‑Oriented High Availability Architecture for a Game Access System

The article presents a business‑centric high‑availability solution for a large‑scale game access platform, detailing measurable goals, a three‑dimensional architecture that includes client‑side retry, HTTP‑DNS, functional separation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid problem detection, recovery, and minimal outage frequency.

Art of Distributed System Architecture Design

Jul 14, 2016

Designing a Business‑Oriented High Availability Architecture for a Game Access System

When discussing high‑availability architecture, most people focus on system‑level designs such as master‑slave, cluster, or multi‑data‑center setups, converting a single machine to dual machines, clusters, or geographically distributed data centers.

However, purely technical solutions cannot guarantee 100% uptime; real‑world incidents like operator errors or large‑scale attacks still cause outages. Therefore, true high availability must be considered from a business perspective, not just system architecture.

Alibaba’s 9Game access system, responsible for user login, registration, payment, and developer verification, requires extremely high availability because any outage prevents users from playing games, leading to massive complaints.

To achieve this, we abandoned the traditional system‑centric approach and adopted a business‑centric, “three‑dimensional” high‑availability architecture, which is described in the following sections.

1. Business‑Oriented High‑Availability Goals

Industry metrics use nines (e.g., 5‑nine = 5 minutes downtime per year), but these are hard to grasp and guide design. After many discussions, we settled on a quantifiable goal: locate problems within 3 minutes, recover services within 5 minutes, and experience at most one incident every two months , which aligns with a 4‑nine availability target.

This goal proved useful because it focuses on business outcomes, can be broken down top‑down, and serves as a clear benchmark during design discussions.

Focuses on business rather than technology, keeping the effort aligned with user impact.

Enables straightforward decomposition of tasks.

Provides a concrete yardstick for evaluating solution feasibility.

2. Three‑Dimensional High‑Availability Architecture Design

The overall business goal can be split into three sub‑goals:

1. Prevent problems as much as possible

2. Quickly locate problems

3. Quickly restore business

No single system can satisfy all three; a comprehensive, business‑level design is required. The resulting architecture (see Figure 1) combines multiple techniques to achieve the objectives.

The diagram shows that the solution is not a traditional software architecture but a business‑level high‑availability design, where only “active‑active across regions” belongs to classic architecture; the rest are business functions that together ensure overall availability.

3. Client Retry + HTTP‑DNS

When a problem occurs, the fastest mitigation is client‑side retry. However, DNS unreliability can make retries ineffective. Common DNS issues include hijacking, cache poisoning, and long TTLs.

To overcome DNS limitations, we introduced HTTP‑DNS, where the client obtains host addresses via an HTTP API instead of the traditional DNS resolver. Advantages:

1. Full control and fine‑grained routing based on business needs.

2. Rapid updates—clients receive the latest address immediately after a change, enabling second‑level fault handling.

HTTP‑DNS is used only in abnormal scenarios to avoid performance penalties; normal traffic still uses traditional DNS.

4. Function Separation + Degradation

To protect core services, we separate core and non‑core functions both logically and physically (different databases, servers, caches). Core functions (login, registration, verification) must always be available, while non‑core functions can be degraded during incidents.

We built an admin tool that disables a non‑core function with a single button click, reducing the “5‑minute recovery” time to a few seconds.

5. Multi‑Region Active‑Active

Traditional active‑active solutions either use cross‑city replication (eventual consistency) or same‑city high‑cost private networks (strong consistency). Our business requires strong consistency for login, so we adopted a business‑driven approach:

1. Asynchronous distribution – each region independently generates data and asynchronously pushes it to others, achieving eventual consistency.

2. Secondary reads – if a region cannot find data, it reads from another region via service APIs, ensuring real‑time consistency.

3. Duplicate data generation – globally unique identifiers are generated algorithmically, allowing any region to recreate the same data without coordination.

This combination enables any region to take over the entire workload when another region fails.

6. Three‑Dimensional, Automated, Visual Monitoring

Typical incident response involves manual log inspection, database checks, and ad‑hoc commands, which cannot meet the 3‑minute detection goal. We built a monitoring system that automatically collects, analyzes, and visualizes data across five layers:

Business layer – traffic, success rate, etc.

Application service layer – per‑URI metrics.

Interface call layer – external service latency and error codes.

Infrastructure component layer – containers, databases, caches, queues.

Infrastructure layer – OS, network, CPU, memory.

Automation is achieved with Logstash (log collection), Redis (caching), and Elasticsearch (storage and analysis). Visualization presents the collected metrics as dashboards for rapid diagnosis.

7. Summary

By adopting the three‑dimensional high‑availability architecture, we achieved the original goals:

3‑minute problem location – automated monitoring and scripts identify the root cause quickly.

5‑minute business recovery – failing machines are taken offline, non‑core functions are degraded, or traffic is switched to another region.

At most one incident every two months – core‑non‑core separation, rigorous deployment reviews, automated testing, gray releases, and high‑availability components (MySQL, Memcached) further reduce outage frequency.

The project was supported by senior leadership and involved multiple engineers, highlighting the importance of cross‑functional collaboration for high‑availability success.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Monitoring fault tolerance business continuity

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.