Designing a Business‑Oriented High Availability Architecture for a Game Access System
The article presents a business‑centric high‑availability solution for a large‑scale game access platform, detailing measurable goals, a three‑dimensional architecture that includes client‑side retry, HTTP‑DNS, functional separation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid problem detection, recovery, and minimal outage frequency.
When discussing high‑availability architecture, most people focus on system‑level designs such as master‑slave, cluster, or multi‑data‑center setups, converting a single machine to dual machines, clusters, or geographically distributed data centers.
However, purely technical solutions cannot guarantee 100% uptime; real‑world incidents like operator errors or large‑scale attacks still cause outages. Therefore, true high availability must be considered from a business perspective, not just system architecture.
Alibaba’s 9Game access system, responsible for user login, registration, payment, and developer verification, requires extremely high availability because any outage prevents users from playing games, leading to massive complaints.
To achieve this, we abandoned the traditional system‑centric approach and adopted a business‑centric, “three‑dimensional” high‑availability architecture, which is described in the following sections.
1. Business‑Oriented High‑Availability Goals
Industry metrics use nines (e.g., 5‑nine = 5 minutes downtime per year), but these are hard to grasp and guide design. After many discussions, we settled on a quantifiable goal: locate problems within 3 minutes, recover services within 5 minutes, and experience at most one incident every two months , which aligns with a 4‑nine availability target.
This goal proved useful because it focuses on business outcomes, can be broken down top‑down, and serves as a clear benchmark during design discussions.
Focuses on business rather than technology, keeping the effort aligned with user impact.
Enables straightforward decomposition of tasks.
Provides a concrete yardstick for evaluating solution feasibility.
2. Three‑Dimensional High‑Availability Architecture Design
The overall business goal can be split into three sub‑goals:
1. Prevent problems as much as possible
2. Quickly locate problems
3. Quickly restore business
No single system can satisfy all three; a comprehensive, business‑level design is required. The resulting architecture (see Figure 1) combines multiple techniques to achieve the objectives.
The diagram shows that the solution is not a traditional software architecture but a business‑level high‑availability design, where only “active‑active across regions” belongs to classic architecture; the rest are business functions that together ensure overall availability.
3. Client Retry + HTTP‑DNS
When a problem occurs, the fastest mitigation is client‑side retry. However, DNS unreliability can make retries ineffective. Common DNS issues include hijacking, cache poisoning, and long TTLs.
To overcome DNS limitations, we introduced HTTP‑DNS, where the client obtains host addresses via an HTTP API instead of the traditional DNS resolver. Advantages:
1. Full control and fine‑grained routing based on business needs.
2. Rapid updates—clients receive the latest address immediately after a change, enabling second‑level fault handling.
HTTP‑DNS is used only in abnormal scenarios to avoid performance penalties; normal traffic still uses traditional DNS.
4. Function Separation + Degradation
To protect core services, we separate core and non‑core functions both logically and physically (different databases, servers, caches). Core functions (login, registration, verification) must always be available, while non‑core functions can be degraded during incidents.
We built an admin tool that disables a non‑core function with a single button click, reducing the “5‑minute recovery” time to a few seconds.
5. Multi‑Region Active‑Active
Traditional active‑active solutions either use cross‑city replication (eventual consistency) or same‑city high‑cost private networks (strong consistency). Our business requires strong consistency for login, so we adopted a business‑driven approach:
1. Asynchronous distribution – each region independently generates data and asynchronously pushes it to others, achieving eventual consistency.
2. Secondary reads – if a region cannot find data, it reads from another region via service APIs, ensuring real‑time consistency.
3. Duplicate data generation – globally unique identifiers are generated algorithmically, allowing any region to recreate the same data without coordination.
This combination enables any region to take over the entire workload when another region fails.
6. Three‑Dimensional, Automated, Visual Monitoring
Typical incident response involves manual log inspection, database checks, and ad‑hoc commands, which cannot meet the 3‑minute detection goal. We built a monitoring system that automatically collects, analyzes, and visualizes data across five layers:
Business layer – traffic, success rate, etc.
Application service layer – per‑URI metrics.
Interface call layer – external service latency and error codes.
Infrastructure component layer – containers, databases, caches, queues.
Infrastructure layer – OS, network, CPU, memory.
Automation is achieved with Logstash (log collection), Redis (caching), and Elasticsearch (storage and analysis). Visualization presents the collected metrics as dashboards for rapid diagnosis.
7. Summary
By adopting the three‑dimensional high‑availability architecture, we achieved the original goals:
3‑minute problem location – automated monitoring and scripts identify the root cause quickly.
5‑minute business recovery – failing machines are taken offline, non‑core functions are degraded, or traffic is switched to another region.
At most one incident every two months – core‑non‑core separation, rigorous deployment reviews, automated testing, gray releases, and high‑availability components (MySQL, Memcached) further reduce outage frequency.
The project was supported by senior leadership and involved multiple engineers, highlighting the importance of cross‑functional collaboration for high‑availability success.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.