Fundamentals of Distributed Systems: Models, Replication, Consistency, and Protocols
This article introduces core concepts of distributed systems, including node and replica models, various consistency levels, data distribution strategies, lease and quorum mechanisms, replica control protocols such as primary‑secondary, two‑phase commit, MVCC, Paxos, and the CAP theorem, providing a comprehensive overview for architects.
1. Concepts
In a distributed system a node is typically an OS process; the model treats a node as an indivisible whole, allowing a process to be split into multiple nodes if needed.
Typical exceptions include machine crashes, network failures (message loss, reordering, unreliable TCP), the three‑state result of an RPC (success, failure, timeout), and storage data loss. The golden rule for exception handling is that any scenario considered during design will occur in production, and many unforeseen exceptions will also appear.
Replica (copy) provides redundancy for data or services. Data replicas persist the same data on different nodes to survive loss; service replicas provide identical functionality without relying on local storage.
Replica Consistency
Consistency models range from strong consistency (every read sees the latest write) to monotonic, session, eventual, and weak consistency, each offering different trade‑offs between freshness and availability.
Metrics for Distributed Systems
Key indicators are performance (throughput, latency, concurrency), availability (uptime vs. downtime), scalability (linear performance growth with added nodes), and consistency (how uniformly replicas reflect updates).
2. Distributed System Principles
Data Distribution Methods
Data can be partitioned by hash (simple modulo mapping, suffers from poor scalability and data skew), by range (splitting the key space into intervals, often managed by metadata servers), by size/chunk (fixed‑size blocks independent of key values), or by consistent hashing (nodes placed on a ring, optionally using virtual nodes to improve balance and scalability).
Choosing a distribution strategy often involves combining methods to mitigate their individual drawbacks, such as adding chunk‑based rebalancing to a hash‑based layout to address data skew.
Replica Control Protocols
Centralized protocols use a single coordinator (e.g., primary‑secondary) to order updates, perform concurrency control, and enforce consistency. The primary handles all writes, propagates updates to secondaries, and may employ a relay chain to avoid bandwidth bottlenecks.
Decentralized protocols treat all nodes as peers, achieving consensus without a single point of failure but at the cost of higher complexity and lower performance.
Primary‑Secondary Protocol
The primary receives external updates, orders them, and forwards them to secondaries; secondaries may be marked unavailable if they fail to sync.
Lease Mechanism
Leases grant a time‑bounded guarantee that a server will not modify a piece of data; clients cache data while the lease is valid, improving performance while preserving correctness even under network partitions or node crashes.
Quorum Mechanism
Writes must succeed on at least W out of N replicas; reads query at least R replicas with W+R > N to guarantee reading a committed value. Variants include write‑all‑read‑one (WARO) and more balanced configurations.
Log Techniques
Redo logs record the result of each update for crash recovery; checkpoints periodically dump in‑memory state to reduce replay time. The 0/1 directory approach achieves atomic batch updates by toggling a master record between two directory versions.
Two‑Phase Commit (2PC)
A classic centralized protocol with a coordinator and participants. Phase 1 (prepare) gathers votes; Phase 2 (commit/abort) finalizes the transaction. 2PC offers strong consistency but suffers from poor availability and performance.
Multi‑Version Concurrency Control (MVCC)
Each transaction creates a new data version; reads can select appropriate versions, enabling high concurrency while preserving a history of changes.
Paxos Consensus
Nodes act as proposers, acceptors, and learners. A value is chosen when a majority of acceptors accept it. Paxos provides strong consistency with good availability and partition tolerance, essentially a quorum‑based protocol.
CAP Theorem
It states that a distributed system can at most simultaneously provide two of the three guarantees: Consistency, Availability, and Partition tolerance. Different protocols make different trade‑offs: lease favors C + P, quorum balances all three, 2PC gives C only, while Paxos achieves C with reasonable A and P.
Overall, the article offers a comprehensive overview of distributed‑system fundamentals, guiding architects in selecting appropriate models, replication strategies, consistency guarantees, and consensus protocols.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.