Operations 14 min read

How to Build Truly High‑Availability Systems: Redundancy, Failover, and Layered Architecture

High availability (HA) is essential for distributed systems, requiring redundancy and automatic failover across each architectural layer—from client to proxy, gateway, business logic, cache, and storage—to minimize downtime, achieve desired “nines” of uptime, and prevent cascading failures such as service snowballing.

Architect's Alchemy Furnace

May 10, 2022

How to Build Truly High‑Availability Systems: Redundancy, Failover, and Layered Architecture

1. What Is High Availability

High availability (HA) is a factor that must be considered in distributed system architecture design, referring to reducing the time a system cannot provide service through design.

For example, a system that provides service continuously has 100% availability; if it is unavailable for 1–2 time units out of 100, its availability is 99%–98%. A system guaranteeing 99% availability can have up to 3.65 days of downtime per year.

Most enterprises aim for “four nines” (99.99%); the number of nines indicates availability.

2 nines : basic availability, less than 88 hours downtime per year.

3 nines : higher availability, less than 9 hours downtime per year.

4 nines : high availability with automatic recovery, less than 53 minutes downtime per year.

5 nines : ultra‑high availability, less than 5 minutes downtime per year.

Availability is calculated as:

Downtime = fault recovery time – fault detection time.

Annual availability = (1 – downtime / total annual time) × 100%.

Availability is tied to performance assessments of technology, operations, etc., and influences architectural decisions. Rapid business growth can lower the practical availability target, prompting strategies such as scaling or supporting backend resources.

Failures are often weighted (e.g., accident‑level weight 100, A‑class weight 20) and scored as: score = failure duration (minutes) × weight.

2. How to Ensure System High Availability

Single points of failure are the biggest enemy of HA; the principle is to avoid them by using clustering or redundancy.

Redundancy Design

Distributed systems should avoid single‑point failures by deploying multiple instances, preferably in different physical locations. Redundancy improves throughput and enables rapid disaster recovery. Common patterns include master‑slave and peer‑to‑peer designs, with master‑slave further divided into one‑master‑multiple‑slaves or multi‑master configurations.

Redundancy raises consistency considerations: strong consistency versus eventual consistency. According to the CAP theorem, a system cannot simultaneously guarantee consistency, availability, and partition tolerance. Strong consistency often sacrifices HA, as illustrated by Zookeeper, where only the leader can write and a leader failure makes the cluster unavailable for writes.

Peer‑to‑peer designs, such as Netflix’s open‑source Eureka service registry, achieve HA by deploying multiple Eureka servers that synchronize asynchronously, providing eventual consistency while maintaining high availability.

Circuit‑Breaker Design

In distributed systems, a request may traverse multiple services; if a downstream service fails, it can cause upstream services to fail, leading to a cascade or “snowball” effect. Implementing a circuit‑breaker mechanism isolates upstream services when downstream services are overloaded or unavailable, preventing system‑wide outages.

Beyond redundancy, manual recovery adds downtime, so automatic failover is essential.

3. Typical Internet Layered Architecture

The common layered architecture consists of:

Client layer (browsers or mobile apps).

Reverse‑proxy layer (system entry point).

Gateway layer (API authentication and routing).

Business‑logic layer (highly cohesive, loosely coupled services).

Persistence layer (data storage and caching).

High availability is achieved by applying redundancy and automatic failover at each layer.

4. Layer‑by‑Layer HA Practices

Client → Reverse‑Proxy : Deploy at least two reverse‑proxy instances (e.g., Nginx) with keepalived and a virtual IP. If one Nginx fails, keepalived detects it and shifts traffic to the backup transparently.

Reverse‑Proxy → Gateway : Configure Nginx to balance requests across multiple gateway services and perform health checks. Failed gateways are automatically bypassed.

Gateway → Business Logic : Services register themselves with a service registry (e.g., Eureka) and send heartbeats. The gateway fetches the live service list and routes requests to healthy instances; failed instances are removed automatically.

Business Logic → Cache : Use cache redundancy (e.g., Redis master‑slave with Sentinel). Sentinel monitors master health and redirects clients to a new master upon failure.

Business Logic → Database : Deploy dual‑master MySQL with keepalived and a virtual IP. When the primary fails, keepalived switches the IP to the standby, ensuring transparent continuity.

5. Summary

High availability is a fundamental requirement for distributed systems, achieved through redundancy and automatic failover at every architectural layer. By designing each layer with redundant components and failover mechanisms—client, proxy, gateway, business logic, cache, and storage—organizations can meet desired “nines” of uptime and prevent cascading failures.

distributed-systems System Architecture Failover redundancy

Written by

Architect's Alchemy Furnace

A comprehensive platform that combines Java development and architecture design, guaranteeing 100% original content. We explore the essence and philosophy of architecture and provide professional technical articles for aspiring architects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.