Big Data 7 min read

Kafka Core Concepts: Architecture, Replication, Controllers, High Watermark, and Message Production/Consumption Mechanisms

This article provides a comprehensive overview of Kafka's internal architecture, including its controller role, replication strategy, high‑watermark mechanism, message serialization, partitioning, batching, consumer pull model, log storage format, and retention policies, supplemented by a detailed mind‑map illustration.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Kafka Core Concepts: Architecture, Replication, Controllers, High Watermark, and Message Production/Consumption Mechanisms

Previously we have shared several Kafka articles; this piece consolidates core Kafka kernel knowledge, covering architecture, replication, controller, high‑watermark, log storage, and message production and consumption mechanisms, with a reference mind‑map at the end.

1. Architecture Summary

Kafka follows a master‑slave architecture with a Controller that coordinates the entire cluster. Common Kafka terms include broker, topic, partition, segment, producer, and consumer.

2. Message Sending Mechanism

1) Serializer: converts message objects to byte arrays for network transmission.

2) Partitioner: determines the target partition for each message; if a partition is explicitly specified, the partitioner is bypassed.

3) Message Buffer Pool: client‑side buffer (default 32 MB, controlled by buffer.memory).

4) Batch Sending: messages in the buffer are sent in batches (default batch size 16 KB, controlled by batch.size).

3. Replication Mechanism

1) Default replication factor is 1 ( default.replication.factor).

2) Replication provides redundancy for reliability and high availability through leader election; it does not offer read/write separation.

3) Leader election is handled by the Controller, selecting the first alive replica in the ISR (in‑sync replica) set.

4) Unclean leader election is controlled by unclean.leader.election.enable (default false); when enabled, a non‑ISR replica may become leader if the ISR is empty.

4. Controller Overview

Roles include managing topics, partition reassignment, leader election, metadata management, and broker membership (handling broker failures or additions). Controller election and failover rely on ZooKeeper's znode model and watch mechanisms.

5. High Watermark (HW) Mechanism

HW defines the offset up to which all replicas have synchronized; only messages below the HW are visible to consumers. It enables asynchronous replica synchronization.

LEO (Log End Offset) indicates the position where the next message will be written.

6. Message Consumption Mechanism

Producers push messages, while consumers pull them. Pull advantages: consumers control read speed and volume; disadvantages: consumers must poll repeatedly because they cannot know if new data is available.

Delivery semantics: Kafka guarantees at‑least‑once delivery by default; at‑most‑once can be implemented by users, and exactly‑once depends on the downstream storage system.

Partition assignment strategies include RangeAssignor (default), RoundRobinAssignor, and StickyAssignor (introduced in Kafka 0.11 for more balanced distribution).

7. Log Storage Mechanism

Logs are stored in segments; each segment file is controlled by log.segment.bytes (default 1 GB). Index files use a sparse format, written every log.index.interval.bytes bytes (default 4 KB).

$ ll
-rw-r--r-- 1 kafka kafka    1002496 Apr 25 17:08 00000000000051402174.index
-rw-r--r-- 1 kafka kafka 1073741338 Apr 25 17:08 00000000000051402174.log
-rw-r--r-- 1 kafka kafka   10485760 Apr 26 15:03 00000000000051638285.index
-rw-r--r-- 1 kafka kafka  219984088 Apr 26 15:04 00000000000051638285.log

Log retention policies are checked periodically (default every 5 minutes via log.retention.check.interval.ms) and can be based on size ( log.retention.bytes, disabled by default), time ( log.retention.hours, default 7 days), or start offset (introduced in Kafka 0.11 to delete processed intermediate messages). Time‑based retention is the most commonly used.

To further deepen Kafka knowledge, the author provides a high‑resolution mind‑map for download.

Follow the public account and reply with kafka2020 to receive the PDF of the Kafka kernel mind‑map.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

High Watermarkmessage-consumption
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.