Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System
The article describes how Alibaba's game integration platform achieved business‑oriented high availability by abandoning traditional system‑centric designs and implementing a three‑dimensional architecture that combines clear HA goals, multi‑active deployment, client‑side retries, functional isolation, automated monitoring, and rapid fault recovery, ultimately meeting a 3‑minute issue‑location and 5‑minute business‑recovery target.
Abstract: To achieve high availability for Alibaba's Jiuyou game integration system, engineers moved from traditional system‑centric HA thinking to a business‑centric, three‑dimensional HA architecture, and the article presents concrete practices of this approach.
1. Business‑Oriented HA Goals While industry HA metrics use "nines" (e.g., 4‑nine = 50 minutes downtime per year), the team sought a more intuitive target and settled on quantifiable goals: locate issues within 3 minutes, recover business within 5 minutes, and limit incidents to at most once every two months, which aligns with the 4‑nine benchmark.
The goals focus on business impact, guide design decisions, and provide a clear evaluation criterion.
2. Three‑Dimensional HA Architecture Design
The overall HA objective is broken into three sub‑goals:
Avoid problems as much as possible.
Quickly locate problems.
Rapidly recover business (not merely fixing the root cause).
Because no single system can satisfy all three, a holistic, business‑level design is required. The resulting architecture (see Figure 1) combines traditional multi‑active data centers with additional services that together ensure overall business HA.
3. Client‑Side Retry + HTTP‑DNS Client‑side retries can mask transient errors, but DNS unreliability (e.g., hijacking, cache poisoning, long TTL) limits effectiveness. The team introduced HTTP‑DNS, a custom HTTP‑based name resolution service that provides fast, controllable updates and can be used selectively during abnormal conditions.
4. Functional Isolation + Degradation Core functions (login, registration, verification) are physically separated from non‑core functions (messaging, logging). When a fault occurs, non‑core services can be disabled instantly via a backend control panel, preserving core service availability.
5. Multi‑Active Across Regions Traditional multi‑active solutions (cross‑city or same‑city data centers) face consistency or cost challenges. The team adopted a business‑driven approach: asynchronous data distribution, secondary reads across regions, and deterministic data generation to achieve eventual and, where needed, real‑time consistency without tight coupling.
6. Three‑Dimensional, Automated, Visual Monitoring Fault handling is organized into five layers: business, application service, interface call, infrastructure component, and underlying infrastructure. Automated data collection (Logstash, Redis, Elasticsearch) feeds visual dashboards, enabling rapid issue detection and location within the 3‑minute target.
7. Summary By aligning architecture with business‑oriented HA goals, employing client‑side retries, HTTP‑DNS, functional isolation, business‑level multi‑active data handling, and a fully automated visual monitoring platform, the team consistently meets the targets of locating issues within 3 minutes, restoring business within 5 minutes, and limiting incidents to roughly once every two months.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
