Operations 21 min read

How Alibaba Games Built a 4‑9 High‑Availability System: Architecture, HTTP‑DNS & Ops Practices

This article details Alibaba Games' journey to achieve four‑nine reliability through a business‑focused high‑availability architecture, including system analysis, a four‑layer design, HTTP‑DNS client retry, service decoupling, multi‑active deployment, comprehensive monitoring, and measurable operational goals.

Efficient Ops
Efficient Ops
Efficient Ops
How Alibaba Games Built a 4‑9 High‑Availability System: Architecture, HTTP‑DNS & Ops Practices

Project Background

The team faced frequent outages—four major incidents in a month—including cabinet power loss, switch failures, server crashes, and bugs, each causing downtime from half an hour to two hours, severely impacting game login experience.

Analysis

Initial blame fell on operations, but deeper analysis identified the root cause as weak system design. The solution was to shift responsibility to development and design robust, high‑availability systems.

Overall Architecture

The architecture is divided into four layers: user, network, service, and operations. Each layer implements measures to meet business‑oriented availability goals.

High‑Availability Goals – Traditional Approach

Industry‑standard "nines" (e.g., 4‑9 or 5‑9) are common but hard for non‑technical stakeholders to interpret.

High‑Availability Goals – Business‑Oriented

The team set a target of locating issues within 3 minutes, restoring service within 5 minutes, and limiting major incidents to once every two months. This translates to roughly a 4‑9 availability level.

High‑Availability Overall Design

The solution comprises four layers, each with specific countermeasures to achieve the business‑oriented goal.

Client Retry + HTTP‑DNS

Client Retry

When a backend failure occurs, the SDK retries the request on a different server, ensuring the retry does not hit the same faulty server.

Traditional DNS Issues

Traditional DNS suffers from hijacking, pollution, and caching, which can cause retries to hit the same problematic server.

HTTP‑DNS

The team built a private HTTP‑DNS service that maps domain names to servers via HTTP, allowing operators to instantly remove faulty servers and enabling servers to report status directly.

Combined Client Retry + HTTP‑DNS

Normal traffic uses traditional DNS for performance; upon failure, the system falls back to HTTP‑DNS, which bypasses caching and provides immediate updates.

Architecture Decoupling

Business Separation

Core game functions (login, registration, parameter delivery) are split from non‑core services (messaging, logging, updates) into separate systems accessed via interfaces, preventing non‑core failures from affecting core gameplay.

Service Center

The Service Center acts like an internal DNS, providing name‑to‑address resolution and allowing faulty instances to be removed dynamically, similar to HTTP‑DNS but for internal service calls.

Business Degradation

In critical situations, non‑core services can be selectively degraded (e.g., returning 500/503) at the interface level, preserving core functionality while sacrificing optional features.

Multi‑Active Deployment

Previous architecture used a single primary database, creating a global single point of failure and cross‑region replication delays. The new design introduces dual primary databases with application‑level data synchronization and secondary reads to mitigate latency and ensure continuity.

360° Monitoring

Integrated Layers

Monitoring spans five layers: business, application service, interface call, component, and infrastructure, providing comprehensive visibility for rapid fault localization.

Automation

An ELK‑based real‑time log collection and analysis pipeline automates fault detection, eliminating manual log retrieval and script debugging.

Visualization

Key metrics (traffic, success rate, latency, error rate) are visualized, enabling any team member to quickly assess system health.

Design Philosophy

Business‑Oriented : Focus on the entire business flow rather than isolated modules.

Technology‑Driven : Solutions rely on technical improvements rather than process or hardware changes.

Core Focus : Non‑core services can be disabled during emergencies.

Quantifiable : All goals and metrics are measurable.

Results

Before the redesign, the system experienced roughly one major outage per month (≈3‑9 availability). After implementation, there have been no major incidents, achieving ≈4‑9 availability, even surviving hardware failures by seamless failover.

Vision

Eliminate the notion of “operations bearing the blame” by having development design high‑availability systems collaboratively with operations, testing, and product teams.

MonitoringSystem ArchitectureOperationshigh-availabilityHTTP-DNSService Decoupling
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.