Data Replication in Distributed Systems – Part 1: Models, Challenges, and Design Considerations
The article surveys three data‑replication models—master‑slave, multi‑master, and leaderless—explains how they enable scalability and fault‑tolerance, and examines core distributed‑system challenges such as partial node failures, unreliable networks, and unsynchronized clocks, while stressing safety‑liveness trade‑offs and design techniques like quorum and timeouts.
Introduction
Designing distributed systems is complex, especially the data replication and consistency aspects. Without a solid theoretical foundation, replication algorithms are likely to fail.
This article series follows the book Designing Data‑Intensive Applications (DDIA) , adding practical insights and mapping concepts to Kafka.
Table of Contents
1. Introduction
2. Data Replication Patterns
3. Challenges of Distributed Systems
4. Summary
1. Introduction
Replication is used to achieve scalability, fault‑tolerance, and a uniform user experience by distributing data across multiple machines.
Two storage approaches exist: sharding (partial data per node) and full replication (complete copies, called replicas in Kafka).
2. Data Replication Patterns
2.1 Master‑Slave (Primary‑Secondary)
Writes go to the primary; replicas (secondaries) receive logs. Reads can be served by either primary or secondaries, improving scalability. Consistency depends on synchronous vs. asynchronous replication.
Kafka provides acks settings to choose between async and sync replication and uses ISR (in‑sync replica) to tolerate some replica failures.
2.2 Multi‑Master Replication
Multiple primaries handle writes, increasing throughput and allowing geographic distribution. Conflict resolution becomes essential; strategies include avoiding conflicts, converging to a consistent state, or letting users resolve conflicts.
2.3 Leaderless Replication
Any replica can accept writes, or a coordinator handles writes without enforcing order. Quorum writes and reads ensure that a majority of replicas overlap, providing eventual consistency.
3. Challenges of Distributed Systems
3.1 Partial Failures
Nodes may fail independently, networks may drop packets, and clocks are not synchronized, making it hard to guarantee consistency.
3.2 Unreliable Network
Network issues include lost requests, delayed acknowledgments, and node unavailability. Application‑level problems (GC pauses, heavy I/O, page faults) exacerbate these issues.
3.3 Unreliable Clocks
Wall‑clock time can drift, causing problems for timestamp‑based ordering (e.g., last‑write‑wins). Monotonic clocks are preferred for measuring durations. TrueTime (Google) provides bounded timestamps for stronger guarantees.
3.4 Designing Under Uncertainty
Use timeouts to detect failures, employ quorum protocols, and balance safety (no incorrect behavior) with liveness (progress). System models include synchronous, semi‑synchronous, and asynchronous; failure models include crash‑stop and crash‑recover.
Safety and Liveness
Safety ensures nothing bad happens; liveness ensures desired actions eventually occur. Distributed algorithms must satisfy safety and, under reasonable assumptions, liveness.
4. Summary
The article introduced three replication models (master‑slave, multi‑master, leaderless) and highlighted two fundamental challenges: partial node failures and unsynchronized clocks. The next article will cover transactions, consistency, and consensus.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
