Operations 22 min read

Building a 3‑Minute Fault Detection, 5‑Minute Recovery HA System for Games

This article explains how Alibaba’s NineGame platform achieved ultra‑high availability by shifting from system‑centric to business‑centric design, defining measurable goals (3‑minute issue detection, 5‑minute recovery, bi‑monthly incidents) and implementing a layered, automated, visual monitoring, client‑side retry, HTTP‑DNS, functional isolation, and multi‑site active‑active architecture.

21CTO

Jul 30, 2016

Building a 3‑Minute Fault Detection, 5‑Minute Recovery HA System for Games

Usually when we talk about high‑availability architecture design, we focus on system‑level structures such as primary‑backup, cluster, or multi‑center architectures, converting a single machine to dual, a dual to a cluster, or a single data‑center to geographically distributed data‑centers.

Alibaba NineGame’s game access system (responsible for login, registration, payment, and CP user verification) requires extremely high availability; any outage prevents users from playing and generates massive complaints.

To achieve high availability, we moved from a system‑centric view to a business‑centric view, creating a “three‑dimensional” HA architecture.

1. Business‑Oriented HA Goals

Industry HA metrics use nines (e.g., 5 nines = 5 minutes downtime per year), but these are hard to grasp. We instead set a quantifiable goal: detect issues within 3 minutes, recover business within 5 minutes, and experience at most one incident every two months , which aligns with the 4‑nine target.

Focus on business, not just technology, to keep the final effect on target.

Top‑down decomposition of the goal makes required actions clear.

The goal serves as a clear benchmark during design discussions.

2. Three‑Dimensional HA Architecture Design

The HA goal breaks into three sub‑goals:

1. Prevent problems as much as possible

Without prevention, rapid fixes are meaningless.

2. Quickly locate problems

Fast detection avoids long‑lasting impact.

3. Quickly restore business

Restoration focuses on getting the service back, not necessarily fixing the root cause.

None of these can be satisfied by a single system; a holistic, business‑oriented design is required.

The diagram shows that the solution is a business‑level HA architecture, where only “geo‑distributed active‑active” belongs to traditional system architecture; other components work together to ensure overall business availability.

3. Client‑Side Retry + HTTP‑DNS

When a problem occurs, the client perceives it first. Simple client‑side retry can mitigate many errors, but DNS unreliability limits effectiveness.

Common DNS issues:

Client‑side DNS hijacking or hosts file tampering.

Cache‑polluted DNS servers returning wrong IPs.

Long DNS cache TTLs (minutes to hours).

To overcome DNS limits, we introduced HTTP‑DNS, where the client obtains host addresses via an HTTP API, giving us full control, fast updates, and the ability to combine traditional DNS for normal traffic with HTTP‑DNS for fault‑tolerant retries.

4. Functional Isolation + Degradation

We prioritized technical solutions over process or personnel measures. By separating core (login, registration, verification) from non‑core functions (push, logging) and physically isolating their resources, we protect core services.

When a fault occurs, non‑core functions can be degraded or disabled via a backend management tool, achieving the “5‑minute business recovery” goal.

5. Geo‑Distributed Active‑Active

Traditional active‑active solutions face data consistency challenges. We adopted a business‑layer approach:

1. Asynchronous distribution – data generated in one site is asynchronously pushed to others, ensuring eventual consistency.

2. Secondary reads – if a site cannot read data, it queries another site via APIs, achieving real‑time consistency.

3. Duplicate data generation – globally unique data is generated algorithmically, so any site can recreate the same identifier.

6. Three‑Dimensional, Automated, Visual Monitoring

We built a monitoring system that collects data across five layers: business, application service, interface call, infrastructure component, and host infrastructure. Automation eliminates manual log analysis, and visualization presents key metrics and alerts in real time.

Logstash collects logs, Redis caches them, and Elasticsearch stores and analyzes them, enabling fast fault localization.

7. Summary

By defining clear, measurable HA goals and implementing a three‑dimensional, business‑centric architecture with client‑side retry, HTTP‑DNS, functional isolation, geo‑distributed active‑active, and automated visual monitoring, we consistently achieve:

Issue detection within 3 minutes.

Business recovery within 5 minutes.

At most one incident every two months.

Additional safeguards include deployment reviews, automated testing, gray releases, TCP‑copy pre‑release environments, MySQL and Memcached high‑availability, and more.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations fault tolerance multi-site active-active business‑centric design

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.