High Availability Design in Internet Architecture: Redundancy and Automatic Failover
This article explains the principles of high availability in internet systems, covering redundancy, automatic failover, availability metrics, and detailed HA designs for each architectural layer such as load balancers, microservices, middleware, and databases.
High availability (HA) aims to ensure continuous business service from the user's perspective by designing redundant and fault‑tolerant architectures. A layered approach splits a large IT system into application, middleware, and storage layers, each further divided into fine‑grained components that must all be HA‑designed.
Availability Levels
Availability Level
System Uptime %
Downtime / Year
Downtime / Month
Downtime / Week
Downtime / Day
Unavailable
90%
36.5 days
73 hours
16.8 hours
144 minutes
Basic
99%
87.6 hours
7.3 hours
1.68 hours
14.4 minutes
Higher
99.9%
8.76 hours
43.8 minutes
10.1 minutes
1.44 minutes
High
99.99%
52.56 minutes
4.38 minutes
1.01 seconds
8.64 seconds
Very High
99.999%
5.26 minutes
26.28 seconds
6.06 seconds
0.86 seconds
Typical large‑scale internet services target at least four 9s (99.99% uptime), while mission‑critical systems may require five 9s.
Internet Architecture Overview
Most modern internet systems adopt a micro‑service architecture consisting of the following layers:
Access layer – usually F5 hardware or LVS software handling all inbound traffic.
Reverse‑proxy layer – Nginx for URL routing, rate limiting, etc.
Gateway – flow control, risk control, protocol conversion.
Site layer – aggregates basic services (member, promotion) and returns JSON to clients.
Base services – infrastructure‑level micro‑services used by upper layers.
Storage layer – databases such as MySQL, Oracle.
Middleware – Zookeeper, Redis, Elasticsearch, MQ, etc.
Each component must be made highly available.
Access & Reverse‑Proxy Layer
Both layers achieve HA through keepalived and LVS in a master‑backup configuration. The master holds the virtual IP (VIP); if it fails, keepalived detects the heartbeat loss and promotes the backup to master, causing the VIP to “float” to the backup node. Keepalived can also monitor Nginx health and remove failed instances from the LVS pool.
Micro‑service Layer (Dubbo Example)
Dubbo providers register themselves to a registry (e.g., Zookeeper or Nacos). Consumers subscribe to the registry and obtain a list of available providers. If a provider becomes unavailable, the registry’s heartbeat mechanism removes it from the list, enabling automatic failover similar to keepalived.
Middleware
Zookeeper
Zookeeper provides HA via a leader‑follower model. The single leader handles transaction ordering, while followers replicate data. If the leader fails, followers hold an election (using the ZAB protocol) to select a new leader, eliminating the single‑point‑of‑failure.
Redis
Redis HA can be deployed in master‑slave mode with Sentinel as the arbitrator. Sentinel clusters use gossip to detect master failures and Raft‑based elections to promote a slave to master. In cluster (sharding) mode, data is split into slots across multiple masters, each with its own replicas; Raft is used to elect a new master if one fails.
Elasticsearch
ES stores data in primary and replica shards across multiple nodes. A dedicated master node manages cluster state and shard allocation. If the master fails, other nodes elect a new master (using a Bully‑style algorithm). Any node can serve read/write requests, routing writes to the appropriate primary shard.
Message Queue (Kafka)
Kafka achieves HA by replicating each partition across multiple brokers. One replica acts as the leader; followers stay in cold‑standby. If the leader broker crashes, a follower is elected as the new leader, ensuring continuous message delivery.
Storage Layer (MySQL Example)
MySQL HA follows the same master‑slave pattern, often protected by keepalived and a VIP. For large data volumes, sharding (multiple masters) is used, each with its own slaves, and the same HA mechanisms apply.
Beyond HA – Operational Practices
Even with HA at the component level, systems must handle traffic spikes, DDoS attacks, code bugs, deployment issues, third‑party failures, and natural disasters. Practices such as isolation, rate limiting, circuit breaking, risk control, graceful degradation, comprehensive monitoring, automated alerts, unit testing, full‑link stress testing, and rapid rollback are essential.
Conclusion
The core ideas of high availability are redundancy and automatic failover. Most components adopt a single‑master plus multiple slaves design because maintaining consistency across multiple masters is complex. Combining HA with robust operational safeguards yields truly reliable internet services.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.