Databases 37 min read

Data Replication in Distributed Systems – Part 1: Models, Challenges, and Design Considerations

The article surveys three data‑replication models—master‑slave, multi‑master, and leaderless—explains how they enable scalability and fault‑tolerance, and examines core distributed‑system challenges such as partial node failures, unreliable networks, and unsynchronized clocks, while stressing safety‑liveness trade‑offs and design techniques like quorum and timeouts.

Meituan Technology Team

Aug 25, 2022

Data Replication in Distributed Systems – Part 1: Models, Challenges, and Design Considerations

Introduction

Designing distributed systems is complex, especially the data replication and consistency aspects. Without a solid theoretical foundation, replication algorithms are likely to fail.

This article series follows the book Designing Data‑Intensive Applications (DDIA) , adding practical insights and mapping concepts to Kafka.

1. Introduction

2. Data Replication Patterns

3. Challenges of Distributed Systems

4. Summary

1. Introduction

Replication is used to achieve scalability, fault‑tolerance, and a uniform user experience by distributing data across multiple machines.

Two storage approaches exist: sharding (partial data per node) and full replication (complete copies, called replicas in Kafka).

2. Data Replication Patterns

2.1 Master‑Slave (Primary‑Secondary)

Writes go to the primary; replicas (secondaries) receive logs. Reads can be served by either primary or secondaries, improving scalability. Consistency depends on synchronous vs. asynchronous replication.

Kafka provides acks settings to choose between async and sync replication and uses ISR (in‑sync replica) to tolerate some replica failures.

2.2 Multi‑Master Replication

Multiple primaries handle writes, increasing throughput and allowing geographic distribution. Conflict resolution becomes essential; strategies include avoiding conflicts, converging to a consistent state, or letting users resolve conflicts.

2.3 Leaderless Replication

Any replica can accept writes, or a coordinator handles writes without enforcing order. Quorum writes and reads ensure that a majority of replicas overlap, providing eventual consistency.

3. Challenges of Distributed Systems

3.1 Partial Failures

Nodes may fail independently, networks may drop packets, and clocks are not synchronized, making it hard to guarantee consistency.

3.2 Unreliable Network

Network issues include lost requests, delayed acknowledgments, and node unavailability. Application‑level problems (GC pauses, heavy I/O, page faults) exacerbate these issues.

3.3 Unreliable Clocks

Wall‑clock time can drift, causing problems for timestamp‑based ordering (e.g., last‑write‑wins). Monotonic clocks are preferred for measuring durations. TrueTime (Google) provides bounded timestamps for stronger guarantees.

3.4 Designing Under Uncertainty

Use timeouts to detect failures, employ quorum protocols, and balance safety (no incorrect behavior) with liveness (progress). System models include synchronous, semi‑synchronous, and asynchronous; failure models include crash‑stop and crash‑recover.

Safety and Liveness

Safety ensures nothing bad happens; liveness ensures desired actions eventually occur. Distributed algorithms must satisfy safety and, under reasonable assumptions, liveness.

4. Summary

The article introduced three replication models (master‑slave, multi‑master, leaderless) and highlighted two fundamental challenges: partial node failures and unsynchronized clocks. The next article will cover transactions, consistency, and consensus.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems System Design kafka Data Replication consistency quorum replication models

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Table of Contents

1. Introduction

2. Data Replication Patterns

2.1 Master‑Slave (Primary‑Secondary)

2.2 Multi‑Master Replication

2.3 Leaderless Replication

3. Challenges of Distributed Systems

3.1 Partial Failures

3.2 Unreliable Network

3.3 Unreliable Clocks

3.4 Designing Under Uncertainty

Safety and Liveness

4. Summary

Meituan Technology Team

How this landed with the community

Was this worth your time?

0 Comments