Operations 9 min read

How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

This article explains the essential requirements for achieving 99.99% service availability—consistency, eliminating single points, placement groups, traffic isolation, same‑city active‑active, N+1 redundancy, and multi‑region active‑active—illustrated with a step‑by‑step Yum repository service case study and evolving architecture diagrams.

ITPUB
ITPUB
ITPUB
How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

High‑Availability Requirements

Consistency

All service modules must run on identical hardware specifications, operating‑system versions, base software stacks, system parameters, configuration files and dependency versions. Configuration‑management tools such as Puppet can enforce periodic checks and automatic remediation to keep the environment consistent.

Eliminate Single Points of Failure

A single point exists when a module is deployed with only one instance, or when any instance failure causes the whole service to become unavailable. Identify such points by counting instances per module and performing destructive tests. Mitigation includes adding redundant instances and preparing failover procedures.

Placement Groups (Fault Domains)

A fault domain is the maximal impact area of a hardware or power failure, typically a rack or a set of switches. Distribute each module’s instances across different fault domains to avoid correlated failures. In public clouds, creating a placement group forces the scheduler to spread instances across distinct fault domains.

Traffic Isolation

Geographic segmentation (e.g., North, Central, South regions)

Hardware‑type segmentation (PC, APP, etc.)

Priority segmentation (VIP vs. regular users)

Resource‑intensive vs. lightweight request segmentation

Same‑City Active‑Active (Multi‑AZ)

Deploy identical service stacks in two or more availability zones (AZs) within the same city. Physical isolation of power and network reduces the impact of a single‑site failure. Cloud DNS or load balancers split traffic between AZs, keeping most requests within the local AZ and providing automatic failover.

N+1 Redundancy

Define N as the capacity required to handle peak load and add one extra capacity unit ( +1 ) for redundancy. All AZs should have comparable capacity so that the loss of any AZ does not overload the remaining zones. Example: with two AZs, +1 yields 50 % extra capacity; with five AZs, the overhead drops to 20 %.

Multi‑Region Active‑Active

Extend the active‑active topology across distant regions to survive whole‑region outages. This introduces higher inter‑region latency (e.g., ~50 ms between North and South China) and increased bandwidth cost. Accept short‑term data inconsistency when strict consistency is not required.

Case Study – High‑Availability Deployment of a Yum Repository Service

Initial Simple Architecture

The service originally consisted of a single cloud VM running Nginx with local storage. Problems:

Single‑node failure caused a complete outage.

Limited I/O and request capacity created a performance bottleneck.

Version 2.0 – Removing Single Points

Introduce multiple stateless Nginx instances and replace local storage with a distributed file system to guarantee data consistency. Use Puppet for configuration management and place each instance in a different fault domain via a placement group.

High‑availability architecture 2.0
High‑availability architecture 2.0

Version 3.0 – Same‑City Active‑Active

Deploy two identical stacks in separate AZs in North China. Cloud DNS distributes traffic between the AZs, providing automatic failover and load balancing.

High‑availability architecture 3.0
High‑availability architecture 3.0

Version 4.0 – Multi‑Region Active‑Active

Extend the deployment to a second region (South China) with the same configuration. Traffic is isolated by region, allowing temporary data inconsistency while preserving availability.

High‑availability architecture 4.0
High‑availability architecture 4.0

Key Takeaways

Each additional resilience feature (fault‑domain placement, N+1 capacity, active‑active across AZs or regions) increases operational complexity and cost. A balanced design must consider the required availability target, performance demands, and budget constraints while ensuring robust monitoring, change management and failover procedures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

architectureDeploymentfault toleranceservice reliabilitycloud operations
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.