Fundamentals 26 min read

Why Distributed Consensus Is So Hard: From CAP to Byzantine Fault Tolerance

Distributed systems rely on consensus to ensure consistent results, but achieving it faces fundamental challenges such as network unreliability, node failures, and trade‑offs captured by the CAP theorem, FLP impossibility, and various algorithms like Paxos, Raft, and Byzantine Fault Tolerance, each balancing consistency, availability, and safety.

21CTO
21CTO
21CTO
Why Distributed Consensus Is So Hard: From CAP to Byzantine Fault Tolerance

Introduction

As Moore's law reaches its limits, many systems must use distributed clusters to handle massive data and provide scalable compute power. Blockchains are a type of distributed system, and the first problem when moving from a centralized to a distributed architecture is guaranteeing consistency.

If a cluster cannot ensure that all nodes produce the same result, any application built on top of it will fail.

Consistency Problem

In a distributed system, consistency means that multiple service nodes, under a protocol (often a consensus algorithm), try to reach a common view of the result of a series of operations.

When consistency is achieved, the system can appear as a single "virtual" processing node with better performance and stability.

Example: two cinemas sell tickets for the same movie. To avoid overselling, they must coordinate the remaining ticket count.

Note: Consistency does not guarantee correctness of the result, only that the external state is the same across nodes.

Challenges

Network communication between nodes is unreliable (arbitrary delays, message loss).

Nodes may process incorrectly or crash.

Synchronous calls hurt scalability.

Simple ideas such as calling the other cinema before each sale or alternating selling hours illustrate how serializing concurrent operations can solve the problem, but real systems need algorithmic guarantees.

Requirements

Termination: a consistent result must be reached in finite time.

Consensus: all nodes must decide on the same value.

Validity: the decided value must be one of the proposals submitted by nodes.

Constrained Consistency

Strong consistency (e.g., sequential consistency, linearizability) provides strict ordering guarantees but is expensive. Sequential consistency ensures a global total order that respects each node’s local order. Linearizability adds real‑time ordering, requiring a single global order equivalent to a sequential execution.

Practical systems often relax consistency to achieve better performance, leading to eventual consistency.

Consensus Algorithms

To achieve a chosen level of consistency, systems employ consensus algorithms that let nodes agree on a proposal.

Note: Clients often need to query multiple nodes to verify that a decision has been reached.

Problem Challenges

In an ideal fault‑free, low‑latency environment, consensus is trivial. Real systems face delays, network partitions, node crashes, and even malicious (Byzantine) behavior.

Common Algorithms

For non‑Byzantine failures, Paxos, Raft, and their variants are widely used. For Byzantine failures, PBFT, PoW, and other BFT algorithms are employed.

Theoretical Limits

The FLP impossibility theorem states that in an asynchronous system with even a single possible crash, no deterministic algorithm can guarantee consensus.

The CAP theorem (Consistency, Availability, Partition tolerance) shows that a system cannot simultaneously guarantee all three; designers must sacrifice one.

CAP Theorem

CAP asserts that a distributed system can at most provide two of the three guarantees:

Consistency: all nodes see the same data at the same time (strong consistency).

Availability: every non‑failed node responds to requests.

Partition tolerance: the system continues operating despite network partitions.

Typical design choices weaken consistency (eventual consistency) or availability (reject requests during partitions).

ACID vs. BASE

ACID (Atomicity, Consistency, Isolation, Durability) describes strict transactional guarantees, often at the cost of availability. BASE (Basic Availability, Soft state, Eventual Consistency) relaxes consistency to improve availability.

Paxos and Raft

Paxos, introduced by Leslie Lamport in 1990, is a two‑phase commit‑based consensus algorithm with roles: proposer, acceptor, learner. It guarantees safety (only one value chosen) and liveness (eventual decision) under majority quorum.

Raft simplifies Paxos by using leader election and log replication, making the algorithm easier to understand and implement.

Byzantine Fault Tolerance

The Byzantine Generals Problem models systems where some nodes may act maliciously. Solutions exist only if the number of faulty nodes F satisfies N > 3F (i.e., at least four nodes to tolerate one Byzantine fault).

Practical Byzantine Fault Tolerant (PBFT) uses three phases (pre‑prepare, prepare, commit) to reach agreement when up to one‑third of nodes are faulty.

Blockchain systems like Bitcoin use Proof‑of‑Work to make proposals costly and achieve probabilistic finality, turning economic penalties into a security mechanism.

Reliability Metrics

Service reliability is often expressed as “nines” (e.g., 99.9% = three nines). Higher nines mean less allowable downtime per year. Achieving more than four nines typically requires eliminating single points of failure through replication and distributed designs.

Conclusion

Distributed systems are a crucial area of computer science. Consistency is an ancient yet still vital problem; ideal solutions do not exist, and practical designs must trade off consistency, availability, and performance based on specific requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsCAP theoremRaftByzantine Fault TolerancePaxosconsensus algorithms
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.