How Alibaba Achieved 3‑Minute Issue Detection and 5‑Minute Recovery for Game Services
This article describes how Alibaba's NineGame access system moved from traditional system‑centric high‑availability designs to a business‑oriented, three‑dimensional architecture that enables 3‑minute problem detection, 5‑minute service restoration, and an average of one outage every two months.
Business‑Oriented High‑Availability Goals
The team replaced the usual "five nines" metric with a more actionable target: locate issues within 3 minutes, restore services within 5 minutes, and experience at most one outage every two months, which aligns with a 4‑nine availability level.
Three‑Dimensional High‑Availability Architecture
The architecture is divided into three sub‑goals: prevent problems, locate them quickly, and recover services rapidly. No single system design can satisfy all three, so a holistic, business‑centric approach is required.
Client‑Side Retry + HTTP‑DNS
Client‑side retries reduce perceived downtime, but DNS unreliability can undermine them. The solution is HTTP‑DNS, which lets clients obtain host addresses via HTTP, offering stronger control, faster updates, and the ability to fall back to traditional DNS for normal traffic.
Function Separation + Degradation
Core functions (login, registration, verification) are isolated from non‑core functions (push, logging) both logically and physically, preventing resource contention. When failures occur, non‑core services can be degraded or disabled via a simple admin UI, enabling sub‑5‑minute recovery.
Multi‑Active Data Centers
Traditional active‑active solutions either sacrifice consistency or require heavy investment. Alibaba’s approach lets the business layer decide between eventual and strong consistency, using asynchronous distribution, secondary reads, and deterministic data generation to achieve cross‑region failover without strict data‑layer constraints.
Three‑Dimensional, Automated, Visual Monitoring
Monitoring spans five layers—business, application service, interface call, infrastructure component, and hardware—collecting metrics such as request rates, error codes, and resource usage. Data is automatically gathered via Logstash, cached in Redis, and indexed in Elasticsearch, then visualized for instant fault diagnosis.
Conclusion
By aligning goals with business impact, separating critical from non‑critical functions, employing HTTP‑DNS, and building a three‑dimensional, automated monitoring platform, the system consistently meets the 3‑minute detection, 5‑minute recovery, and bi‑monthly outage objectives.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
