How Alibaba Games Built a 4‑9 High‑Availability System: Architecture, HTTP‑DNS & Ops Practices
This article details Alibaba Games' journey to achieve four‑nine reliability through a business‑focused high‑availability architecture, including system analysis, a four‑layer design, HTTP‑DNS client retry, service decoupling, multi‑active deployment, comprehensive monitoring, and measurable operational goals.
Project Background
The team faced frequent outages—four major incidents in a month—including cabinet power loss, switch failures, server crashes, and bugs, each causing downtime from half an hour to two hours, severely impacting game login experience.
Analysis
Initial blame fell on operations, but deeper analysis identified the root cause as weak system design. The solution was to shift responsibility to development and design robust, high‑availability systems.
Overall Architecture
The architecture is divided into four layers: user, network, service, and operations. Each layer implements measures to meet business‑oriented availability goals.
High‑Availability Goals – Traditional Approach
Industry‑standard "nines" (e.g., 4‑9 or 5‑9) are common but hard for non‑technical stakeholders to interpret.
High‑Availability Goals – Business‑Oriented
The team set a target of locating issues within 3 minutes, restoring service within 5 minutes, and limiting major incidents to once every two months. This translates to roughly a 4‑9 availability level.
High‑Availability Overall Design
The solution comprises four layers, each with specific countermeasures to achieve the business‑oriented goal.
Client Retry + HTTP‑DNS
Client Retry
When a backend failure occurs, the SDK retries the request on a different server, ensuring the retry does not hit the same faulty server.
Traditional DNS Issues
Traditional DNS suffers from hijacking, pollution, and caching, which can cause retries to hit the same problematic server.
HTTP‑DNS
The team built a private HTTP‑DNS service that maps domain names to servers via HTTP, allowing operators to instantly remove faulty servers and enabling servers to report status directly.
Combined Client Retry + HTTP‑DNS
Normal traffic uses traditional DNS for performance; upon failure, the system falls back to HTTP‑DNS, which bypasses caching and provides immediate updates.
Architecture Decoupling
Business Separation
Core game functions (login, registration, parameter delivery) are split from non‑core services (messaging, logging, updates) into separate systems accessed via interfaces, preventing non‑core failures from affecting core gameplay.
Service Center
The Service Center acts like an internal DNS, providing name‑to‑address resolution and allowing faulty instances to be removed dynamically, similar to HTTP‑DNS but for internal service calls.
Business Degradation
In critical situations, non‑core services can be selectively degraded (e.g., returning 500/503) at the interface level, preserving core functionality while sacrificing optional features.
Multi‑Active Deployment
Previous architecture used a single primary database, creating a global single point of failure and cross‑region replication delays. The new design introduces dual primary databases with application‑level data synchronization and secondary reads to mitigate latency and ensure continuity.
360° Monitoring
Integrated Layers
Monitoring spans five layers: business, application service, interface call, component, and infrastructure, providing comprehensive visibility for rapid fault localization.
Automation
An ELK‑based real‑time log collection and analysis pipeline automates fault detection, eliminating manual log retrieval and script debugging.
Visualization
Key metrics (traffic, success rate, latency, error rate) are visualized, enabling any team member to quickly assess system health.
Design Philosophy
Business‑Oriented : Focus on the entire business flow rather than isolated modules.
Technology‑Driven : Solutions rely on technical improvements rather than process or hardware changes.
Core Focus : Non‑core services can be disabled during emergencies.
Quantifiable : All goals and metrics are measurable.
Results
Before the redesign, the system experienced roughly one major outage per month (≈3‑9 availability). After implementation, there have been no major incidents, achieving ≈4‑9 availability, even surviving hardware failures by seamless failover.
Vision
Eliminate the notion of “operations bearing the blame” by having development design high‑availability systems collaboratively with operations, testing, and product teams.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.