Mastering High Availability: 4 Failover Patterns Explained
Understanding high‑availability architectures involves mastering replication and fail‑over, balancing RTO and RPO, and choosing among four patterns—Active‑Standby, Active‑Active, Cold Standby, and Hot Standby—each with distinct synchronization, load‑balancing, and cost considerations for reliable system design.
High‑availability (HA) architectures rely on two complementary mechanisms: data replication for redundancy and automatic fail‑over to switch services when a fault is detected. The effectiveness of a fail‑over solution is measured by:
RTO (Recovery Time Objective) : the elapsed time from failure to service restoration.
RPO (Recovery Point Objective) : the maximum amount of data that may be lost.
Active‑Standby Mode
In this pattern a single active node processes all read/write requests, while a standby node remains idle but continuously synchronizes its state with the active node.
Roles
Active node: handles client traffic (e.g., INSERT/UPDATE in a database).
Standby node: maintains a real‑time “shadow” copy; it only receives traffic after a fail‑over.
Replication methods
Synchronous replication : the active node waits for an acknowledgment from the standby before confirming success to the client. Guarantees RPO = 0 but adds latency. Common in financial transaction systems.
Asynchronous replication : the active node returns to the client immediately; data is propagated in the background. Typical RPO is a few seconds ( RPO < 5 s) with minimal performance impact.
Health monitoring : a heartbeat (often every second) checks the active node. Three consecutive missed heartbeats trigger a fail‑over after a secondary verification step to avoid false alarms.
Fail‑over orchestration : cluster managers such as ZooKeeper or load balancers like HAProxy execute the switch.
Typical implementations : MySQL master‑slave replication, Oracle Data Guard, PostgreSQL streaming replication.
Active‑Active Mode
All nodes are peers; each can accept read and write operations. Nodes are interconnected in a mesh topology, allowing linear scalability with the number of nodes.
Load‑balancing strategies
Round‑robin : cycles through nodes sequentially; simple but ignores current load.
Least connections : directs traffic to the node with the fewest active connections; suited for long‑lived connections (e.g., WebSocket).
Consistent hashing : hashes request attributes to preserve session affinity; useful for stateful services but can limit scalability.
Data consistency
Strong consistency : achieved with distributed consensus protocols such as Raft or Paxos, requiring a majority (⌊N/2⌋+1) of nodes to acknowledge writes.
Eventual consistency : relies on asynchronous replication and conflict‑resolution logic; data may diverge temporarily but converges later.
Design caveat : the system must tolerate “avalanche” effects—if a subset of nodes fails, the remaining nodes must still handle the full traffic load.
Cold Standby Mode
The standby server is powered off or left uninitialized until a failure occurs. Data is restored from periodic backups (e.g., daily snapshots).
Switching mechanism : can be automated with scripts or performed manually.
Recovery characteristics
RTO: minutes to several hours, depending on boot time and data restore speed.
RPO: equals the backup interval; if backups are daily, potential data loss can be up to 24 hours.
Use case : non‑critical workloads with low update frequency, such as internal attendance systems or document repositories.
Hot Standby Mode
The backup node runs continuously and stays synchronized with the primary via incremental mechanisms (e.g., MySQL binary log, Redis AOF).
Replication options
Synchronous replication → RPO = 0 (zero data loss).
Asynchronous replication → RPO < 1 s (sub‑second loss).
Fail‑over tools : Redis Sentinel, MySQL MHA, Keepalived, etc., typically achieve RTO < 5 s.
Key properties
Real‑time readiness: the standby mirrors the primary’s memory state, disk data, and network connections.
Seamless transition: users experience little to no interruption.
Zero or near‑zero data loss as described above.
Conclusion
Each fail‑over pattern emphasizes different trade‑offs:
Active‑Standby and Hot Standby prioritize data consistency and rapid recovery (low RTO, low RPO).
Active‑Active maximizes throughput and scalability at the cost of more complex consistency management.
Cold Standby minimizes cost but incurs longer RTO and higher RPO.
In practice, teams often combine these patterns and integrate them with orchestration platforms such as Kubernetes, adding health‑check verification and data‑integrity validation to ensure reliable fail‑over.
Ma Wei Says
Follow me! Discussing software architecture and development, AIGC and AI Agents... Sometimes sharing insights on IT professionals' life experiences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
