Operations 20 min read

How Alibaba Achieved 3‑Minute Issue Detection and 5‑Minute Recovery for Game Services

This article describes how Alibaba's NineGame access system moved from traditional system‑centric high‑availability designs to a business‑oriented, three‑dimensional architecture that enables 3‑minute problem detection, 5‑minute service restoration, and an average of one outage every two months.

ITFLY8 Architecture Home

Apr 1, 2018

How Alibaba Achieved 3‑Minute Issue Detection and 5‑Minute Recovery for Game Services

Business‑Oriented High‑Availability Goals

The team replaced the usual "five nines" metric with a more actionable target: locate issues within 3 minutes, restore services within 5 minutes, and experience at most one outage every two months, which aligns with a 4‑nine availability level.

Three‑Dimensional High‑Availability Architecture

The architecture is divided into three sub‑goals: prevent problems, locate them quickly, and recover services rapidly. No single system design can satisfy all three, so a holistic, business‑centric approach is required.

Client‑Side Retry + HTTP‑DNS

Client‑side retries reduce perceived downtime, but DNS unreliability can undermine them. The solution is HTTP‑DNS, which lets clients obtain host addresses via HTTP, offering stronger control, faster updates, and the ability to fall back to traditional DNS for normal traffic.

Function Separation + Degradation

Core functions (login, registration, verification) are isolated from non‑core functions (push, logging) both logically and physically, preventing resource contention. When failures occur, non‑core services can be degraded or disabled via a simple admin UI, enabling sub‑5‑minute recovery.

Multi‑Active Data Centers

Traditional active‑active solutions either sacrifice consistency or require heavy investment. Alibaba’s approach lets the business layer decide between eventual and strong consistency, using asynchronous distribution, secondary reads, and deterministic data generation to achieve cross‑region failover without strict data‑layer constraints.

Three‑Dimensional, Automated, Visual Monitoring

Monitoring spans five layers—business, application service, interface call, infrastructure component, and hardware—collecting metrics such as request rates, error codes, and resource usage. Data is automatically gathered via Logstash, cached in Redis, and indexed in Elasticsearch, then visualized for instant fault diagnosis.

Conclusion

By aligning goals with business impact, separating critical from non‑critical functions, employing HTTP‑DNS, and building a three‑dimensional, automated monitoring platform, the system consistently meets the 3‑minute detection, 5‑minute recovery, and bi‑monthly outage objectives.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring multi-active http-dns game services

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.