Big Data 34 min read

Kafka High‑Reliability Architecture, Storage Mechanisms, Replication, and Benchmark Analysis

This article explains Kafka's distributed architecture, its topic‑partition storage model, replication and synchronization mechanisms, reliability guarantees such as ISR and high‑watermark, and presents benchmark results that illustrate how replication factor, acks settings, and partition count affect throughput and latency.

Architecture Digest
Architecture Digest
Architecture Digest
Kafka High‑Reliability Architecture, Storage Mechanisms, Replication, and Benchmark Analysis

1 Overview

Kafka, originally developed by LinkedIn and later donated to the Apache Software Foundation, is a distributed messaging system written in Scala. It is valued for its horizontal scalability and high throughput and is widely integrated with open‑source processing frameworks such as Cloudera, Storm, and Spark. Companies like Vipshop use Kafka as a core internal message engine.

Because Kafka is a commercial‑grade middleware, message reliability is crucial. This article first introduces Kafka’s architecture and basic principles, then analyses its storage mechanism, replication, synchronization, reliability and durability, and finally provides benchmark results to deepen understanding of Kafka’s high‑reliability characteristics.

2 Kafka Architecture

A typical Kafka cluster consists of multiple producers, brokers, consumer groups, and a ZooKeeper ensemble. ZooKeeper manages cluster configuration, leader election, and consumer‑group rebalancing. Producers push messages to brokers, while consumers pull messages from brokers.

2.1 Topic & Partition

Each topic is divided into several partitions; a partition is an append‑only log file. Every message appended to a partition receives a unique long‑type offset. Because writes are sequential, disk I/O is highly efficient.

When a message is sent, the producer can specify a key; the partition is chosen based on the key and the configured partitioner class (which must implement kafka.producer.Partitioner).

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=3

3 High‑Reliability Storage Analysis

Kafka achieves high reliability through a robust replication strategy. The replication factor is configured in server.properties (e.g., default.replication.factor).

3.1 Kafka File Storage Mechanism

Messages are stored per topic, per partition, and each partition is further split into segments. A segment consists of an .index file (metadata) and a .log file (actual messages). Offsets are used to locate messages within segments, and binary search on the index files enables fast retrieval.

3.2 Replication Principle and Synchronization

Each partition has one leader replica and one or more follower replicas. The leader handles all read/write requests; followers replicate the leader’s log asynchronously. The leader maintains the Log End Offset (LEO) and the High Watermark (HW), which is the smallest LEO among all in‑sync replicas (ISR). A message is considered committed only after it is replicated to all ISR members.

3.3 ISR (In‑Sync Replicas)

ISR is the subset of replicas that are fully caught up with the leader. Replicas that fall behind a configurable lag threshold are removed from ISR and placed into the Out‑of‑Sync Replicas (OSR) list. The ISR list is stored in ZooKeeper and is used for leader election.

3.4 Data Reliability and Durability Guarantees

Producers control reliability via the request.required.acks setting:

1 (default): the leader acknowledges receipt; if the leader crashes before followers replicate, data may be lost.

0: no acknowledgment; highest throughput but no reliability.

-1 (all): the leader waits for all ISR replicas to acknowledge; provides the strongest durability guarantee.

When acks=-1, the min.insync.replicas parameter defines the minimum number of ISR members that must acknowledge a write; otherwise the producer receives an exception.

4 High‑Reliability Usage Analysis

4.1 Message Delivery Guarantees

Kafka can provide at‑most‑once, at‑least‑once, or exactly‑once delivery semantics depending on producer acknowledgment settings and consumer commit behavior.

4.2 Message De‑duplication

Because at‑least‑once delivery can cause duplicates, applications often implement de‑duplication using globally unique identifiers (GUIDs) or external stores such as Redis.

4.3 High‑Reliability Configuration

Topic: replication.factor ≥ 3, 1 ≤ min.insync.replicas ≤ replication.factor Broker: unclean.leader.election.enable=false Producer: request.required.acks=-1,

producer.type=sync

5 Benchmark

5.1 Test Environment

Four Kafka brokers (24‑core CPUs, 62 GB RAM, 1 TB disks) running Kafka 0.10.1.0, with a separate client machine (24‑core CPU, 3 GB RAM).

5.2 Scenario 1 – Impact of Replication Factor, min.insync.replicas and acks on TPS

Increasing the acks level reduces TPS ( acks=0 > acks=1 > acks=-1). Higher replication factors also lower TPS, while min.insync.replicas has no effect when acks is not -1.

5.3 Scenario 2 – Fixed Single Partition, Varying Replication and min.insync.replicas

TPS decreases with more replicas; the impact of min.insync.replicas remains negligible.

5.4 Scenario 3 – Fixed Single Partition, Varying acks and Replication

Again, higher acks values reduce TPS, and more replicas further lower TPS.

5.5 Scenario 4 – Effect of Partition Count

Increasing the number of partitions raises TPS up to a point, after which additional partitions cause a slight TPS drop due to overhead.

5.6 Scenario 5 – Broker Failures

With two brokers down, the client can still send messages but TPS drops. With three brokers down, the client cannot send until enough brokers recover to satisfy min.insync.replicas. Retries cause duplicate writes.

5.7 Scenario 6 – Latency Measurements

When acks=-1, TPS is limited by the replication factor; higher replication leads to higher latency and lower throughput.

Overall, Kafka provides high reliability when configured with sufficient replication, appropriate acks, and careful tuning of min.insync.replicas, while still delivering acceptable performance for large‑scale streaming workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaReliabilityBenchmark
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.