Consistency, Consensus, and Reliability in Distributed Systems
This article explains the core challenges of achieving consistency in distributed systems, describes consensus algorithms such as Paxos and Raft, discusses theoretical limits like the FLP impossibility and CAP theorem, and shows how trade‑offs among consistency, availability, and partition tolerance shape practical system design.
Consistency Issues
In distributed systems, consistency means that multiple service nodes reach a common state for a series of operations, usually ensured by a consensus algorithm.
An example of two cinemas selling the same tickets illustrates the need for coordinated decisions to avoid overselling.
Note: Consistency does not guarantee correctness; all nodes may agree on a failure state.
Challenges
Unreliable network communication (delays, faults).
Node failures or crashes.
Synchronous calls hinder scalability.
Simple coordination ideas—phone checks, time‑slot selling, third‑party ticket pool—illustrate the principle of serialising conflicting operations.
Requirements
Termination – a consistent result must be reached in finite time.
Consensus – all nodes eventually decide on the same value.
Validity – the decision must be one of the proposals submitted.
Constrained Consistency
Strong consistency is expensive; weaker models such as eventual consistency are often sufficient for web services.
Consensus Algorithms
Consensus algorithms achieve agreement on a proposal among nodes. They must handle non‑Byzantine faults (crash, omission) and Byzantine faults (malicious behavior).
Problem Challenges
Real‑world systems face network partitions, node crashes, and malicious nodes, making consensus non‑trivial.
Common Algorithms
For non‑Byzantine faults: Paxos, Raft and variants. For Byzantine faults: PBFT, Proof‑of‑Work, etc.
Theoretical Limits
The FLP impossibility theorem states that in an asynchronous system with even a single crash fault, no deterministic algorithm can guarantee consensus.
FLP Impossibility Principle
Proved by Fischer, Lynch, and Paterson (1985), it shows that designing a universally solving consensus algorithm for asynchronous systems is futile.
CAP Theorem
In a distributed system you cannot simultaneously guarantee Consistency, Availability, and Partition tolerance; you must sacrifice at least one.
ACID Principles
Atomicity, Consistency, Isolation, Durability describe strict database guarantees, often relaxed to BASE (Basic Availability, Soft state, Eventual consistency) for scalability.
Paxos and Raft
Paxos (Lamport 1990) uses a two‑phase commit with proposer, acceptor, learner roles to reach consensus; Raft simplifies Paxos with leader election and log replication.
Byzantine Fault Tolerance
PBFT (Castro & Liskov 1999) tolerates up to f faulty nodes in a system of 3f+1 nodes, using pre‑prepare, prepare, and commit phases.
Reliability Metrics
Service reliability is often expressed as “nines” (e.g., 99.99% uptime); higher nines require more redundancy and cost.
Conclusion
Distributed systems are a core area of computer science; achieving consistency involves trade‑offs among performance, availability, and fault tolerance.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.