Big Data 33 min read

How Kafka Guarantees High Reliability – Architecture, Replication & Benchmarks

This article explains Kafka's distributed architecture, topic‑partition model, replication and ISR mechanisms, data durability settings, delivery guarantees, deduplication strategies, and presents benchmark results that illustrate how configuration choices affect throughput and latency in real‑world deployments.

21CTO

Jun 11, 2017

How Kafka Guarantees High Reliability – Architecture, Replication & Benchmarks

Overview

Kafka, originally developed by LinkedIn and now part of Apache, is a Scala‑based distributed messaging system known for horizontal scalability and high throughput. It is widely adopted by internet companies, including Vipshop, as a core message engine.

Kafka Architecture

A typical Kafka cluster consists of multiple producers, a set of brokers, consumer groups, and a Zookeeper ensemble that manages cluster metadata, leader election, and ISR (in‑sync replica) tracking.

Topic & Partition

Each topic is divided into partitions, which are append‑only log files. Messages are written sequentially, giving high write efficiency. Partitioning distributes load across brokers, enabling horizontal scaling.

# Default number of log partitions per topic
num.partitions=3

High‑Reliability Storage Analysis

Kafka stores data in a hierarchy of directories: topic‑>partition‑>segment . Each segment consists of an .index file (metadata) and a .log file (actual messages). Offsets uniquely identify messages within a partition, and binary search on index files enables fast retrieval.

Message physical structure includes offset, size, CRC, magic byte, attributes, key length, key, and payload, allowing precise determination of message boundaries.

Replication Principle and Synchronization

Each partition has N replicas (default ≥1). One replica acts as the leader, handling all reads/writes, while followers replicate the leader’s log. The leader updates the High Watermark (HW) based on the smallest Log End Offset (LEO) among ISR members, ensuring that a message is considered committed only when all ISR replicas have persisted it.

ISR (In‑Sync Replicas)

ISR is a subset of assigned replicas that are fully caught up with the leader. Replicas falling behind beyond configured thresholds are removed from ISR, affecting leader election and write availability.

Data Reliability and Persistence Guarantees

Producers control durability via request.required.acks:

1 (default): leader acknowledgment only; data may be lost if the leader fails.

0: fire‑and‑forget; highest throughput, lowest reliability.

-1 (all): all ISR replicas must acknowledge; highest reliability.

When acks=-1, setting min.insync.replicas ensures that writes are rejected if the number of in‑sync replicas falls below the threshold.

Delivery Guarantees

Kafka can provide three delivery semantics:

At most once – possible loss, no duplication.

At least once – no loss, possible duplication.

Exactly once – requires additional deduplication logic.

Message Deduplication

Kafka does not natively support deduplication. Applications may generate globally unique identifiers (GUID) or use external stores such as Redis to achieve idempotence.

High‑Reliability Configuration

Topic: replication.factor≥3, 2≤min.insync.replicas≤replication.factor Broker: unclean.leader.election.enable=false Producer: request.required.acks=-1,

producer.type=sync

Benchmark Tests

Vipshop’s Kafka cluster (≈2000 topics, billions of daily requests) was evaluated under various scenarios, varying replica count, min.insync.replicas, acks, partition count, and broker failures.

Key findings:

Higher replica counts reduce TPS; acks has the biggest impact (0 > 1 > ‑1).

Increasing partitions improves throughput up to a point, after which performance plateaus or degrades.

When brokers fail, the system continues operating if the remaining ISR satisfies min.insync.replicas; otherwise writes are rejected.

With acks=-1 and proper min.insync.replicas, all successfully returned messages are persisted.

Conclusion

Kafka achieves high reliability through a combination of log‑structured storage, configurable replication, ISR management, and flexible acknowledgment settings, allowing operators to balance durability against performance based on workload requirements.

Author: Vipshop Messaging Middleware Team (VMS), Infrastructure Department.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kafka Performance Benchmark replication Distributed Messaging High reliability

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.