Understanding High Availability: Sources of Complexity and Decision Strategies
The article explains high availability as a source of system complexity, describing how redundancy, hardware and software failures, external disasters, and state‑decision mechanisms such as dictatorial, negotiated, and democratic approaches affect both compute and storage layers, and discusses trade‑offs like the CAP theorem.
Today we discuss the second source of complexity: high availability.
According to Wikipedia, high availability is the ability of a system to execute its functions without interruption, representing a key design principle.
The crucial point is "no interruption," which is hard to achieve because hardware can fail, software can have bugs, hardware ages, and software grows increasingly complex.
Beyond inherent hardware and software limits, external factors such as power outages, floods, or earthquakes can also cause unavailability, often unpredictably.
High‑availability solutions rely on redundancy: multiple machines, duplicate data centers, or multiple network paths. While high performance adds machines to scale processing, high availability adds machines to provide redundant processing units.
Redundancy improves availability but introduces additional complexity, which will be analyzed for different scenarios.
Compute High Availability – Computation logic produces the same result on any machine given the same algorithm and input, so moving computation between machines does not affect business. However, achieving compute HA requires adding a task dispatcher, managing connections, and implementing allocation algorithms (active‑passive, active‑active, cold/warm/hot standby).
More complex HA clusters may use configurations like 1 master + 3 standby, 2 master + 2 standby, etc., with examples such as ZooKeeper (1 master with many standbys) and Memcached (all masters).
Storage High Availability – Data transfer latency between machines (milliseconds within a data center, tens to hundreds of milliseconds across regions) leads to temporary data inconsistency, which can cause business errors (e.g., a bank balance appearing unchanged after a deposit). Transmission lines can also fail, causing prolonged outages, as seen in the 2015 Alipay cable cut and the 2016 trans‑Pacific cable outage.
The CAP theorem shows that a storage system cannot simultaneously guarantee consistency, availability, and partition tolerance; designers must trade off two of the three based on business needs.
High‑Availability State Decision – Systems must detect normal versus abnormal states and act accordingly, but redundancy makes perfect state decision impossible. Three common decision styles are discussed:
Dictatorial: a single decision maker collects status from reporters; a failure of the decision maker disables accurate state assessment.
Negotiated: two nodes exchange information to elect a master; connection loss can cause split‑brain (two masters) or no master.
Democratic: multiple nodes vote (e.g., ZooKeeper leader election); complexity increases, and split‑brain can occur unless a majority rule is enforced, which may reduce overall availability.
In summary, high availability adds significant complexity to both compute and storage layers, and choosing an appropriate state‑decision mechanism involves careful analysis of trade‑offs.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.