Operations 14 min read

Understanding High‑Availability Systems: Design Principles, Technical Solutions, and SLA Measurement

This article explains the comprehensive concept of high‑availability systems, covering redundancy, failover, consistency challenges, various technical solutions, SLA definitions, and the organizational and engineering practices required to achieve multiple “9s” of availability.

Architecture Digest

Aug 22, 2016

I have often seen discussions about high‑availability systems that focus only on the technical solutions of individual companies, but a truly high‑availability system involves much more than technology; it also requires proper design, management, and operational practices.

Understanding High‑Availability Systems

High Availability (HA) means keeping the computing environment—including software and hardware—available full‑time. Designing HA typically involves:

Redundancy of hardware and software to eliminate single points of failure, with standby components ready to take over.

Failure detection and recovery (failover) so that backup nodes can assume the role of a failed component.

Reliable crossover points such as DNS or load balancers that are difficult to replicate.

The real difficulty lies in stateful components: replicating data and guaranteeing consistency across redundant nodes is the most challenging aspect of HA.

If data replication to redundant nodes is asynchronous, a failover can cause data divergence.

If replication is synchronous, performance degrades as the number of redundant nodes grows.

Consequently, HA designs must balance consistency and performance based on business characteristics—for example, bank account balances require strong consistency, while order logs can tolerate eventual consistency.

Data loss prevention requires persistence.

Service HA requires standby replicas for both application and data nodes.

Replication inevitably introduces consistency challenges.

Absolute 100% availability is impossible; we aim for several “9s” of SLA.

Technical Solutions for High Availability

The diagram below (originally from a 2009 Google I/O talk) illustrates the basic building blocks of HA solutions, including Master/Slave (M/S) and Multi‑Master (MM) architectures. While M/S and MM are straightforward, they bring issues such as the performance‑heavy two‑phase commit (2PC) and the complexity of Paxos.

Key drawbacks of common HA approaches:

Final consistency can lead to data divergence during outages.

Strong consistency often relies on slow XA two‑phase commit or complex Paxos protocols.

Open‑source software (caches, message queues, databases) usually provides persistence and replication, but their SLA guarantees are typically lower than commercial solutions.

High‑Availability Solutions Example: MySQL

The following chart shows the SLA (number of “9s”) for various MySQL HA designs, with more “9s” indicating higher complexity.

MySQL Replication (asynchronous or semi‑synchronous) provides less than 2 “9s”.

MySQL Fabric (sharding with read/write split) reaches about 99 % (2 “9s”).

DRBD (disk‑level mirroring) offers up to 3 “9s”.

Solaris Clustering/Oracle VM with heartbeat and SAN delivers around 4 “9s”.

MySQL Cluster with fully synchronous NDB replication approaches 5 “9s”.

Defining HA SLA

Service Level Agreement (SLA) quantifies availability as a percentage, commonly expressed as the number of “9s”. The table below shows the corresponding downtime per year, month, week, and day.

Availability %

Downtime / year

Downtime / month

Downtime / week

Downtime / day

90% (1 9)

36.5 days

72 hours

16.8 hours

2.4 hours

99% (2 9)

3.65 days

7.20 hours

1.68 hours

14.4 minutes

99.9% (3 9)

8.76 hours

43.8 minutes

10.1 minutes

1.44 minutes

99.99% (4 9)

52.56 minutes

4.38 minutes

1.01 minute

8.66 seconds

99.999% (5 9)

5.26 minutes

25.9 seconds

6.05 seconds

0.87 seconds

Even a 3 “9” SLA allows only about 40 minutes of downtime per month; relying on manual, slow incident response cannot meet such guarantees.

Factors Influencing High Availability

Availability is affected by software design, hardware reliability, third‑party services, and even external factors like power outages or construction work. SLA is therefore a contract between provider and user, encompassing both planned and unplanned downtime.

Unplanned Downtime Causes

System‑level failures (hardware, OS, middleware, DB, network, power), data or middleware failures (human error, disk failure, data corruption), and natural disasters or sabotage.

Planned Downtime Causes

Routine tasks: backups, capacity planning, user/security management, batch jobs.

Maintenance: DB, application, middleware, OS, network upkeep.

Upgrades: software, middleware, OS, network, hardware.

What Truly Determines High Availability

Achieving a 5 “9” SLA (less than 5 minutes of annual downtime) requires scientific engineering management, advanced automation tools, and highly skilled teams—not just clever technical designs.

Software design, coding, testing, deployment, and configuration management.

Engineer skill levels.

Operations management and tooling.

Data‑center operational excellence.

Management of third‑party service dependencies.

Beyond technical practices, a company must respect engineering as a discipline, fostering a culture that values rigorous processes, code reviews, CI/CD, and automated testing.

In summary, when evaluating a system’s claimed high availability, examine not only the architecture but also the organization’s engineering culture, processes, and commitment to reliability.

(End of article)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations High Availability System Design SLA redundancy

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.