Key Distributed System Design Patterns Explained by a Senior Architect
This article presents a concise overview of essential distributed‑system design patterns—including Bloom filters, consistent hashing, quorum, leader/follower, heartbeat, fencing, high‑water mark, leases, gossip protocol, PACELC theorem, hinted handoff, read‑repair, and Merkle trees—explaining their purpose, operation, and typical usage in large‑scale storage and coordination services.
Hello everyone, I am a senior architect.
1. Bloom Filter
A Bloom filter is a space‑efficient probabilistic data structure used to test whether an element is a member of a set. It is useful when we only need to know if an element belongs to a collection.
In BigTable (and Cassandra), every read must fetch data from the SSTable that composes a tablet. If the SSTable is not in memory, many disk accesses may be required. To reduce disk I/O, BigTable uses Bloom filters.
2. Consistent Hashing
Consistent hashing enables easy scaling by allowing data to be replicated efficiently, improving availability and fault tolerance. The key is hashed to a position on a ring; moving clockwise, the first node with a position greater than the key stores the data.
The main advantage is incremental stability: only the immediate neighbors of a departing or joining node are affected.
3. Quorum
In a distributed environment, a quorum is the minimum number of servers that must successfully complete an operation before it is considered successful.
For example, Cassandra can be configured to succeed a write only after it has been written to at least one quorum (or majority) of replica nodes.
Chubby uses Paxos for leader election, which also relies on quorum to guarantee strong consistency.
Dynamo uses a “rough” quorum for writes, unlike the strict majority quorum of Paxos. All reads/writes are performed on the first healthy node in the preference list, which may not be the first node encountered when traversing the consistent‑hash ring.
4. Leader and Follower
To achieve fault tolerance, data is replicated across multiple servers. One server is elected as the leader, which makes decisions for the cluster and propagates them to followers.
In a cluster of three to five nodes, leader election occurs at server startup; the system does not accept client requests until a leader is elected.
5. Heartbeat
The heartbeat mechanism detects whether the current leader has failed, allowing a new leader election to be triggered.
6. Fencing
In a leader‑follower model, when a leader fails it may still be active due to network partitions. Fencing places a “fence” around the previously active leader, preventing it from accessing cluster resources and serving any read/write requests.
Two techniques are used:
Resource fencing – the system blocks the former leader from accessing essential resources.
Node fencing – the system cuts off the former leader’s power or resets the node.
7. Write‑Ahead Log (WAL)
WAL records operations before they are applied to the actual storage, inspired by database systems. In case of a crash, the system can replay the log to recover to a consistent state.
8. Segmented Log
Instead of a single large log file, the log is split into multiple smaller files (segments) to avoid performance bottlenecks and simplify cleanup.
9. High‑Water Mark
The high‑water mark tracks the last log entry on the leader that has been successfully replicated to a quorum of followers. Only entries up to this index are exposed to clients.
Kafka uses the high‑water mark to ensure that consumers only see messages that have been replicated.
10. Lease
A lease works like a lock that remains valid for a limited time even if the client disconnects. The client can renew the lease before it expires.
Chubby clients hold a time‑bounded session lease with the leader; the leader guarantees not to terminate the session unilaterally during this interval.
11. Gossip Protocol
Gossip is a peer‑to‑peer communication mechanism where nodes periodically exchange state information about themselves and other nodes.
Each node initiates a gossip round every second, selecting a random peer to exchange status.
12. PACELC Theorem
PACELC extends CAP by adding a latency‑consistency trade‑off when there is no partition. It states:
If a partition occurs (P), the system chooses between Availability (A) and Consistency (C).
Else (E), the system chooses between Latency (L) and Consistency (C).
Thus, PACELC captures both the partition‑time trade‑off (CAP) and the normal‑operation trade‑off (ELC).
13. Hinted Handoff
If a node is down, the system stores hints (or annotations) for the missed requests. When the node recovers, the leader forwards the stored hints to it.
The leader writes these hints to a local text file; once the target node is back online, the leader replays the writes.
14. Read‑Repair
In replicated systems, some replicas may hold stale data. During a read, the system compares data from multiple replicas, identifies outdated nodes, and pushes the latest version to them.
Cassandra and Dynamo use read‑repair to synchronize stale replicas.
15. Merkle Trees
When a replica is significantly behind, comparing whole data sets is inefficient. Merkle trees allow efficient comparison of large data ranges by hashing sub‑ranges.
A Merkle tree is a binary hash tree where each internal node is the hash of its two children, and leaf nodes are hashes of data blocks.
Comparing two Merkle trees proceeds as follows:
Compare the root hashes.
If they match, stop – the data sets are identical.
If they differ, recursively compare left and right children.
Dynamo uses Merkle trees for anti‑entropy and conflict resolution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
