Apache Kafka Overview, Architecture, and Sample Producer/Consumer Code
This article provides a comprehensive overview of Apache Kafka, comparing it with ActiveMQ, explaining its distributed architecture, topics, partitions, consumption models, high‑availability mechanisms, exactly‑once semantics, and includes detailed Java producer and consumer code examples for practical implementation.
1. Overview
Apache Kafka, originally open‑sourced by LinkedIn and now an Apache project, is one of the most widely used distributed messaging systems, marketed as a "distributed streaming platform" since version 0.9.
Key differences from traditional message queues:
Kafka is a distributed system that scales easily.
It provides high throughput for both publishing and subscribing.
Supports multiple consumers with automatic load‑balancing on failures.
Messages are persisted.
Comparison with ActiveMQ:
Feature
Kafka
ActiveMQ
Background
High‑performance, distributed log‑based system for log collection, stream processing, and message distribution.
JMS‑compliant enterprise messaging middleware.
Development Language
Java, Scala
Java
Protocol Support
Custom protocol
JMS
Persistence
Supported
Supported
Transaction Support
Supported since 0.11.0
Supported
Producer Fault Tolerance
Configurable ack levels (0, 1, -1) with possible duplicate data.
Retry on failure with ack model.
Throughput
High, using batch processing, zero‑copy, and O(1) disk operations.
Load Balancing
Managed via ZooKeeper; producers discover brokers and can target specific partitions.
2. Getting Started
2.1 Producer
Java example that creates a KafkaProducer with typical configuration properties and continuously sends numbered messages.
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class UserKafkaProducer extends Thread {
private final KafkaProducer
producer;
private final String topic;
private final Properties props = new Properties();
public UserKafkaProducer(String topic) {
props.put("metadata.broker.list", "localhost:9092");
props.put("bootstrap.servers", "master2:6667");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
producer = new KafkaProducer<>(props);
this.topic = topic;
}
@Override
public void run() {
int messageNo = 1;
while (true) {
String messageStr = "Message_" + messageNo;
System.out.println("Send:" + messageStr);
producer.send(new ProducerRecord<>(topic, messageStr));
messageNo++;
try { Thread.sleep(3000); } catch (InterruptedException e) { e.printStackTrace(); }
}
}
}2.2 Consumer
Java example that configures a KafkaConsumer , subscribes to topics, polls for records, and prints offset, key, and value.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer
consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords
records = consumer.poll(100);
for (ConsumerRecord
record : records) {
System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value());
}
}3. Kafka Architecture Principles
Key questions addressed include how topics and partitions are stored, advantages of Kafka's consumption model, and how distributed storage and retrieval are achieved.
3.1 Architecture Diagram
3.2 Terminology
Name
Explanation
Broker
Message‑handling node; a Kafka cluster consists of one or more brokers.
Topic
Logical channel for categorising messages; each message must specify a topic.
Producer
Client that publishes messages to brokers.
Consumer
Client that reads messages from brokers.
ConsumerGroup
Set of consumers sharing the same group id; only one consumer in a group processes a given message.
Partition
Physical subdivision of a topic; each partition maintains an ordered log.
3.3 Topic and Partition
Each message belongs to a topic, which can have multiple partitions. Partitions store logs in append‑only files; offsets provide a unique, monotonically increasing identifier within a partition, guaranteeing order only at the partition level.
Message routing to partitions:
No key → round‑robin distribution.
With key → hash(key) % number_of_partitions, ensuring the same key always lands in the same partition.
3.4 Consumption Model
Kafka uses a pull‑based model where consumers control the read rate and can seek to arbitrary offsets, unlike push‑based systems that may lose messages on consumer failure.
3.5 Network Model
3.5.1 KafkaClient – Single‑Threaded Selector
Suitable for low‑concurrency scenarios.
3.5.2 Kafka Server – Multi‑Threaded Selector
The server uses an acceptor thread plus thread pools for read/write, allowing high concurrency.
3.6 High‑Reliability Distributed Storage
Kafka achieves durability through log segmentation and sparse indexing, storing each partition as a series of .log and .index files. Reads use binary search on the index to locate the correct segment, then scan the log.
3.6.1 Log Segments
Segments are sized (e.g., 100 messages) and named by the offset of the first message they contain. Sparse indexes keep memory usage low while enabling fast lookups.
3.6.2 Replication Mechanism
Each partition has a leader replica and zero or more follower replicas. The ISR (In‑Sync Replicas) set contains replicas that are up‑to‑date and connected to ZooKeeper; only these can become the new leader on failure.
4. High‑Availability Model and Idempotence
Kafka supports three delivery semantics:
at‑least‑once : May deliver duplicates; requires idempotent processing.
at‑most‑once : May lose messages if the producer does not retry.
exactly‑once : Guarantees a single delivery using transactions and sequence numbers (available from 0.11.0).
4.1 Implementing Exactly‑Once
4.1.1 Single Producer, Single Topic
Each producer gets a unique PID; messages carry a monotonically increasing sequence number. The broker validates the sequence, rejecting out‑of‑order or duplicate messages.
4.1.2 Transactions
Producers provide a Transaction ID; the Transaction Coordinator persists transaction state in an internal topic. Transactions ensure atomicity across multiple partitions and enable recovery after failures.
5. Interview Questions Summary
Typical interview questions derived from the article include why use a message queue, Kafka's storage model, consumption advantages, partitioning rationale, log segmentation, high‑availability mechanisms, and how to design a custom message queue.
Source: https://mp.weixin.qq.com/s/vhwUCdimvpBt5Z38pRX5xw
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.