Designing a Business‑Oriented High‑Availability Architecture for Game Access Systems
The article presents a comprehensive, business‑centric high‑availability architecture for a game access platform, detailing measurable goals, a three‑layered design, client‑side retry with HTTP‑DNS, functional separation and degradation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid issue detection, recovery, and minimal downtime.
Typically, high‑availability architecture discussions focus on system‑level designs such as master‑slave, cluster, or multi‑data‑center structures, but purely technical solutions cannot guarantee 100% reliability.
Therefore, true high availability must be considered from a business perspective, leading to the concepts of "system‑oriented" and "business‑oriented" high‑availability architectures.
Alibaba's game access system, which handles login, registration, payment, and CP verification, requires extremely high availability; any outage severely impacts user experience.
To achieve this, the team abandoned traditional system‑oriented thinking and adopted a "three‑dimensional" business‑oriented high‑availability design, which is described in the following sections.
1. Business‑Oriented High‑Availability Goals
Common industry metrics use "nines" (e.g., 5‑nine = 5 minutes downtime per year), but these are hard to interpret for design. After many discussions, the team settled on a quantifiable goal: "Locate issues within 3 minutes, recover business within 5 minutes, and experience at most one incident every two months" , which aligns with a 4‑nine availability target.
This goal proved useful because it focuses on business outcomes, can be broken down top‑down, and serves as a clear benchmark during design discussions.
Focuses on business rather than technology, preventing misalignment.
Enables straightforward decomposition of tasks.
Provides a concrete criterion for evaluating solution feasibility.
2. Three‑Dimensional High‑Availability Architecture Design
The high‑availability goal is split into three sub‑goals:
1. Minimize incident occurrence
Preventing problems is the primary objective; frequent issues render rapid recovery meaningless.
2. Rapid issue location
Issues must be detected and pinpointed quickly, avoiding long manual investigation times.
3. Rapid business recovery
The emphasis is on restoring service, not necessarily fixing the root cause; for example, taking an unhealthy machine offline can instantly restore traffic.
None of these sub‑goals can be satisfied by a single system architecture; a holistic, business‑driven approach is required.
The diagram shows that the solution is a business‑level high‑availability architecture, where only "active‑active multi‑region" resembles traditional system architecture; the rest are complementary functions that together ensure overall availability.
3. Client‑Side Retry + HTTP‑DNS
Client‑side retry can quickly mitigate errors (HTTP 404/500 or business error codes) if the client and server agree on retry semantics.
However, DNS unreliability—such as hijacking, cache poisoning, or long TTLs—limits effectiveness. Examples include hosts file tampering, DNS cache pollution during a root‑server outage, and prolonged DNS caching.
To overcome DNS issues, the team implemented HTTP‑DNS, where the client obtains host addresses via an HTTP API, gaining strong control, fast updates, and the ability to perform fine‑grained traffic steering.
HTTP‑DNS is not a universal replacement for DNS; it is used only in failure scenarios to preserve performance for normal traffic.
4. Functional Separation + Degradation
To avoid reliance on human processes, the team chose a technology‑driven strategy.
They identified core (login, registration, verification) and non‑core (push, logging) functions, physically isolating them at the database, server, and cache layers to prevent interference.
When a fault occurs, non‑core functions can be degraded or disabled, protecting core services. A custom admin tool enables one‑click disabling of a function within seconds, meeting the 5‑minute recovery goal.
5. Multi‑Region Active‑Active
Traditional active‑active solutions face data consistency challenges across distant data centers.
Two common approaches are cross‑city multi‑data‑center (eventual consistency) and same‑city multi‑data‑center (high cost). Neither fits the game platform's need for real‑time strong consistency.
The team therefore let the business layer control consistency, using three techniques:
1. Asynchronous distribution – data generated in one region is asynchronously propagated to others, ensuring eventual consistency and allowing other regions to take over when one fails.
2. Secondary reads – if a region cannot find a user’s data, it reads from another region via service APIs, achieving real‑time consistency.
3. Duplicate data generation – globally unique data (e.g., IDs) are generated algorithmically so every region can produce identical values; session‑type data can be regenerated locally.
These measures enable a fully functional failover even when an entire data center is down.
6. Three‑Dimensional, Automated, Visual Monitoring
Manual log inspection and ad‑hoc queries cannot meet the 3‑minute issue‑location target.
The solution collects and visualizes data across five layers:
Business layer – traffic, success rates, etc.
Application service layer – per‑URI metrics, response codes, latency.
Interface call layer – external service latency, error codes.
Infrastructure component layer – databases, caches, message queues.
Infrastructure layer – OS, network, CPU, memory.
Automation is achieved with a data collection and analysis system (Logstash → Redis → Elasticsearch) that gathers logs and metrics without human intervention.
Visualization presents the collected data as dashboards and charts, enabling quick identification of anomalies and root causes.
7. Summary
The architecture fulfills the original goals:
3‑minute issue location – automated monitoring and layered metrics allow rapid diagnosis.
5‑minute business recovery – offline faulty machines, function degradation, or region failover restore service quickly.
At most one incident every two months – core/non‑core separation, rigorous deployment reviews, automated testing, gray releases, and high‑availability components (MySQL, Memcached) further reduce failure frequency.
Key participants included Wang Jinyin, Wang Lizhi, Nie Yong, Zhan Qingpeng, Li Yunhua, Li Jun, with strong support from R&D boss Zheng Congwei.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
