Big Data 16 min read

Mastering Kafka: Core Concepts, Architecture, and Reliability Guarantees

This comprehensive guide covers Kafka's definition, publish/subscribe model, key components, storage mechanisms, producer and consumer strategies, and reliability features such as ACK levels, ISR, and exactly‑once semantics, providing a solid foundation for real‑time big‑data processing.

macrozheng

May 21, 2020

Mastering Kafka: Core Concepts, Architecture, and Reliability Guarantees

This article explains what Kafka is, its architecture, workflow, storage mechanism, and the roles of producers and consumers.

Definition

Kafka is a distributed publish/subscribe message queue, primarily used in real‑time big‑data processing.

Message Queue Benefits

Decoupling: Allows independent scaling or modification of processing on both sides of the queue.

Recoverability: Messages remain in the queue and can be processed after system recovery.

Buffering: Helps handle mismatched production and consumption speeds.

Flexibility & Peak Throughput: Prevents total collapse under sudden overload, enabling critical components to handle spikes.

Asynchronous Communication: Producers can place messages without immediate processing.

Publish/Subscribe Model

Producers publish messages to a Topic; multiple consumers subscribe to that Topic, and messages are not removed until consumed.

Architecture

Kafka stores messages from producers in Topics, which are divided into Partitions. Each Partition is an ordered log stored on disk. A Kafka cluster consists of one or more brokers, and Partitions can be distributed across cluster nodes.

Key Concepts

Producer: Client that sends messages to a Kafka broker.

Consumer: Client that reads messages from a Kafka broker.

Consumer Group: Set of consumers sharing the load; each partition is consumed by only one consumer in the group.

Broker: A Kafka server; a cluster contains multiple brokers.

Topic: Logical queue that categorizes messages.

Partition: Physical subdivision of a Topic, each being an ordered log.

Replica: Copies of a partition for fault tolerance, consisting of a Leader and Followers.

Leader: The primary replica handling reads and writes for a partition.

Follower: Replicas that synchronize data from the Leader.

Offset: Position of a consumer within a partition.

Zookeeper: Service that stores and manages cluster metadata.

Workflow

Kafka stores records in Topics; each record contains a key, value, and timestamp.

Storage Mechanism

Messages are appended to log files; to avoid large logs, Kafka splits each Partition into Segments, each with an .index file and a .log file.

# ls /root/data/kafka/first-0
00000000000000009014.index
00000000000000009014.log
00000000000000009014.timeindex
00000000000000009014.snapshot
leader-epoch-checkpoint

Producer

When sending data, a ProducerRecord must specify parameters such as topic, partition (optional), timestamp (optional), key (optional), value (optional), and headers (optional).

Partition assignment rules:

If a partition is specified, the given value is used.

If no partition but a key is present, the key's hash modulo the number of partitions determines the partition.

If neither is provided, Kafka uses a round‑robin (incremental) algorithm.

Data Reliability Guarantees

After a Producer sends data, the broker acknowledges receipt (ACK). The Producer proceeds only after receiving an ACK; otherwise it retries.

ISR (In‑Sync Replica) set contains Followers that are fully synchronized with the Leader. Only when all Followers in ISR have replicated the data does the Leader send an ACK.

ACK levels:

0: Producer does not wait for any ACK; lowest latency but possible data loss.

1: Leader acknowledges after writing to its log; data may be lost if Leader fails before Followers sync.

-1 (all): Leader and all Followers must write to disk before ACK; ensures no data loss but may cause duplicates on Leader failure.

Exactly‑Once Semantics

Setting the producer’s enable.idempotence to true provides idempotence, guaranteeing that duplicate messages are not persisted. At Least Once + 幂等性 = Exactly Once Idempotent producers receive a PID and sequence numbers per partition; the broker deduplicates messages with the same <PID, Partition, SeqNumber>.

Consumer

Consumers can use Pull (poll) or Push (broker‑initiated) modes. Pull allows consumers to control consumption rate, while Push can cause overload.

Consumers maintain their current Offset to resume after failures. Offsets were stored in Zookeeper before version 0.9; now they are stored in the internal __consumer_offsets topic.

Partition Assignment Strategies

Two strategies exist:

RoundRobin: Distributes partitions evenly across consumers, but can cause cross‑topic mixing.

Range: Assigns partitions based on topic order, avoiding mixing but may lead to imbalance when consumers subscribe to multiple topics.

Summary

The article provides an in‑depth theoretical overview of Kafka’s architecture, covering its core concepts, storage design, producer and consumer mechanics, partitioning strategies, and reliability guarantees, laying the groundwork for further practical exploration of Kafka APIs, transactions, interceptors, and monitoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Big Data Streaming Kafka Message Queue

Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.