Understanding High‑Availability Systems: Design Principles, Technical Solutions, and SLA Measurement
This article explains the comprehensive concept of high‑availability systems, covering redundancy, failover, consistency challenges, various technical solutions, SLA definitions, and the organizational and engineering practices required to achieve multiple “9s” of availability.
I have often seen discussions about high‑availability systems that focus only on the technical solutions of individual companies, but a truly high‑availability system involves much more than technology; it also requires proper design, management, and operational practices.
Understanding High‑Availability Systems
High Availability (HA) means keeping the computing environment—including software and hardware—available full‑time. Designing HA typically involves:
Redundancy of hardware and software to eliminate single points of failure, with standby components ready to take over.
Failure detection and recovery (failover) so that backup nodes can assume the role of a failed component.
Reliable crossover points such as DNS or load balancers that are difficult to replicate.
The real difficulty lies in stateful components: replicating data and guaranteeing consistency across redundant nodes is the most challenging aspect of HA.
If data replication to redundant nodes is asynchronous, a failover can cause data divergence.
If replication is synchronous, performance degrades as the number of redundant nodes grows.
Consequently, HA designs must balance consistency and performance based on business characteristics—for example, bank account balances require strong consistency, while order logs can tolerate eventual consistency.
Data loss prevention requires persistence.
Service HA requires standby replicas for both application and data nodes.
Replication inevitably introduces consistency challenges.
Absolute 100% availability is impossible; we aim for several “9s” of SLA.
Technical Solutions for High Availability
The diagram below (originally from a 2009 Google I/O talk) illustrates the basic building blocks of HA solutions, including Master/Slave (M/S) and Multi‑Master (MM) architectures. While M/S and MM are straightforward, they bring issues such as the performance‑heavy two‑phase commit (2PC) and the complexity of Paxos.
Key drawbacks of common HA approaches:
Final consistency can lead to data divergence during outages.
Strong consistency often relies on slow XA two‑phase commit or complex Paxos protocols.
Open‑source software (caches, message queues, databases) usually provides persistence and replication, but their SLA guarantees are typically lower than commercial solutions.
High‑Availability Solutions Example: MySQL
The following chart shows the SLA (number of “9s”) for various MySQL HA designs, with more “9s” indicating higher complexity.
MySQL Replication (asynchronous or semi‑synchronous) provides less than 2 “9s”.
MySQL Fabric (sharding with read/write split) reaches about 99 % (2 “9s”).
DRBD (disk‑level mirroring) offers up to 3 “9s”.
Solaris Clustering/Oracle VM with heartbeat and SAN delivers around 4 “9s”.
MySQL Cluster with fully synchronous NDB replication approaches 5 “9s”.
Defining HA SLA
Service Level Agreement (SLA) quantifies availability as a percentage, commonly expressed as the number of “9s”. The table below shows the corresponding downtime per year, month, week, and day.
Availability %
Downtime / year
Downtime / month
Downtime / week
Downtime / day
90% (1 9)
36.5 days
72 hours
16.8 hours
2.4 hours
99% (2 9)
3.65 days
7.20 hours
1.68 hours
14.4 minutes
99.9% (3 9)
8.76 hours
43.8 minutes
10.1 minutes
1.44 minutes
99.99% (4 9)
52.56 minutes
4.38 minutes
1.01 minute
8.66 seconds
99.999% (5 9)
5.26 minutes
25.9 seconds
6.05 seconds
0.87 seconds
Even a 3 “9” SLA allows only about 40 minutes of downtime per month; relying on manual, slow incident response cannot meet such guarantees.
Factors Influencing High Availability
Availability is affected by software design, hardware reliability, third‑party services, and even external factors like power outages or construction work. SLA is therefore a contract between provider and user, encompassing both planned and unplanned downtime.
Unplanned Downtime Causes
System‑level failures (hardware, OS, middleware, DB, network, power), data or middleware failures (human error, disk failure, data corruption), and natural disasters or sabotage.
Planned Downtime Causes
Routine tasks: backups, capacity planning, user/security management, batch jobs.
Maintenance: DB, application, middleware, OS, network upkeep.
Upgrades: software, middleware, OS, network, hardware.
What Truly Determines High Availability
Achieving a 5 “9” SLA (less than 5 minutes of annual downtime) requires scientific engineering management, advanced automation tools, and highly skilled teams—not just clever technical designs.
Software design, coding, testing, deployment, and configuration management.
Engineer skill levels.
Operations management and tooling.
Data‑center operational excellence.
Management of third‑party service dependencies.
Beyond technical practices, a company must respect engineering as a discipline, fostering a culture that values rigorous processes, code reviews, CI/CD, and automated testing.
In summary, when evaluating a system’s claimed high availability, examine not only the architecture but also the organization’s engineering culture, processes, and commitment to reliability.
(End of article)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
