Beyond Redundancy: The Real Secrets to Building Truly High‑Availability Systems
This article explains that high‑availability systems involve more than just redundant hardware or software; they require careful design, data consistency strategies, realistic SLA calculations, and disciplined engineering practices to achieve the coveted “nines” of uptime.
I often see discussions of high‑availability (HA) that focus only on a company’s technical stack, but HA is far more than a set of technologies; it also encompasses design principles, operational processes, and organizational culture.
Understanding High Availability
High Availability (HA) means keeping the entire computing environment—software and hardware—available full‑time. Typical design requirements include:
Redundant hardware and software to eliminate single points of failure; standby nodes take over when needed.
Failure detection and automatic failover to a backup node.
Highly reliable crossover points such as DNS and load balancers that are difficult to duplicate.
While these concepts sound simple, the real challenge lies in stateful node data replication and ensuring consistency. Asynchronous replication can cause data divergence during failover, while synchronous replication may degrade performance as the number of replicas grows.
Consequently, HA solutions always involve trade‑offs that must align with business characteristics—for example, bank account balances require strong consistency, whereas order logs can tolerate eventual consistency.
Key Design Principles for HA
Persist data to avoid loss.
Provide standby replicas for both application and data nodes.
Address data‑consistency challenges that arise from replication.
Accept that 100 % availability is impossible; aim for a certain number of “9s” in the SLA.
Technical Solutions for High Availability
In my reference "Transaction Across DataCenter" (Google I/O 2009), the following diagram illustrates the foundational HA approaches:
These approaches include Master/Slave (M/S), Multi‑Master (MM), two‑phase commit (2PC), and Paxos. Each has drawbacks: 2PC suffers from poor performance, while Paxos is complex to implement.
Example: MySQL High‑Availability Solutions
The chart below (source: MySQL High Availability Solutions) shows the SLA “nines” each solution can achieve, with more nines implying greater complexity:
MySQL Replication (asynchronous or semi‑synchronous) – less than 2 nines.
MySQL Fabric (sharding with read/write split) – about 99 % (2 nines).
DRBD (disk‑level mirroring, similar to RAID‑1) – less than 3 nines.
Solaris Clustering/Oracle VM (full‑stack solution with heartbeat, SAN, virtualization) – close to 4 nines.
MySQL Cluster (NDB with fully synchronous replication) – close to 5 nines.
Defining SLA “nines”
The SLA is measured by the allowable downtime per year, month, week, or day. For example:
99 % (2 nines) – up to 3.65 days of downtime per year.
99.9 % (3 nines) – up to 8.76 hours per year.
99.99 % (4 nines) – up to 52.56 minutes per year.
99.999 % (5 nines) – up to 5.26 minutes per year.
Even a 3‑nine SLA allows only about 40 minutes of downtime per month, which many teams cannot achieve without automated tools and disciplined processes.
Factors Influencing High Availability
Many variables affect a system’s SLA, including software design, hardware reliability, third‑party services, and even external factors like construction work. HA is therefore a contract between service provider and user, not just a technical metric.
Unplanned Downtime Causes
System‑level failures: host, OS, middleware, database, network, power, peripherals.
Data and middleware failures: human error, disk crashes, data corruption.
External events: natural disasters, sabotage, power outages.
Planned Downtime Causes
Routine tasks: backups, capacity planning, user/security management, batch jobs.
Maintenance: database, application, middleware, OS, network upkeep.
Upgrades: software, middleware, OS, network, hardware.
What Truly Determines HA Success
Achieving high availability demands rigorous engineering management, advanced automation, skilled engineers, and robust operational processes. It also requires a culture that respects engineering as a science, with leadership that values it.
Software design, coding, testing, deployment, and configuration management.
Engineer skill levels.
Operations management and technical capability.
Data‑center operational excellence.
Management of third‑party service dependencies.
Beyond technical tactics, the deeper factors are attitudes toward technology, engineering culture, and leadership support.
So, when someone claims their system is highly available, evaluate not only the architecture but also the organization’s engineering discipline and respect for the science of reliability.
Source: https://coolshell.cn/articles/17459.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
