Kafka Overview, Architecture, Installation, and Operational Guide
This article provides a comprehensive introduction to Kafka, covering its definition, message queue concepts, architecture components, installation steps, configuration details, startup procedures, operational commands, producer and consumer mechanisms, reliability guarantees, partition strategies, offset management, and performance optimizations.
1. Kafka Overview
Kafka is a distributed, publish/subscribe based message queue primarily used for real‑time processing in big‑data scenarios.
1.1 Definition
Kafka is a distributed message queue that follows the publish/subscribe model.
1.2 Message Queue
1.2.1 Traditional vs. Modern Queues
Traditional queues require the entire downstream process (e.g., sending an SMS after user registration) to complete before responding to the client. Modern queues allow the system to return a response immediately after persisting data, while subsequent processes run asynchronously.
1.2.2 Benefits of Using a Message Queue
Decoupling
Recoverability
Buffering
Flexibility & peak‑handling capacity
Asynchronous communication
1.2.3 Queue Models
Point‑to‑point: a producer sends a message to a queue, a single consumer retrieves and processes it. Only one consumer can consume a given message.
Publish/Subscribe: a producer publishes to a topic; multiple consumers (subscribers) receive the same message. Kafka uses this model. Consumers can pull messages (pull) or be pushed (push) – Kafka uses pull.
1.3 Kafka Basic Architecture
The core components are brokers, producers, consumer groups, and ZooKeeper.
Producer – sends messages.
Broker – buffers messages, hosts topics, partitions, and replication.
Consumer group – processes messages; consumers in the same group share partitions.
ZooKeeper – stores cluster metadata and consumer offsets (pre‑0.9).
Before version 0.9 offsets were stored in ZooKeeper; from 0.9 onward they are stored in an internal Kafka topic.
1.4 Kafka Installation
A. Install by extracting the tarball:
tar -zxvf kafka_2.11-2.1.1.tgz -C /usr/local/
B. View configuration files:
[root@es1 config]# pwd
/usr/local/kafka/config
[root@es1 config]# ll
... (list of *.properties files) ...C. Edit server.properties to set broker.id (unique per broker).
D. Set the data storage path (must contain only Kafka data).
E. Configure whether topics can be deleted (default: not allowed).
F. Set data retention time (default 7 days).
G. Set maximum log file size (e.g., 1 GB).
H. Configure ZooKeeper connection address and Kafka timeout.
I. Set default number of partitions.
1.5 Starting Kafka
A. Blocking start (single‑node): each broker must be started manually.
B. Daemon (recommended) start using a background script.
1.6 Kafka Operations
A. List existing topics (connects to ZooKeeper).
B. Create a topic with specified partitions and replication factor.
C. Delete a topic.
D. View topic details.
1.7 Producing and Consuming Messages
A. Start a producer (connects to port 9092).
B. Start a consumer (connects to 9092; pre‑0.9 used 2181).
Multiple consumers can be started to test parallel consumption.
2. Deep Dive into Kafka Architecture
2.1 Kafka Workflow
Messages are categorized by topics; producers write to topics, consumers read from topics. A topic is a logical concept; partitions are physical storage units.
Each partition has replicas and is backed by a log file where messages are appended with an offset. Consumers track the offset they have processed.
2.2 Kafka Internals
To avoid large log files, each partition is split into segments, each consisting of an index file and a log file. The index maps offsets to physical positions, enabling fast binary search.
3. Producers and Consumers
3.1 Producers
Partitions improve concurrency; producers can specify a partition or use round‑robin distribution.
3.2 Reliability via ACKs
Producers receive acknowledgments from the leader after replicas have synced. Three ACK levels:
acks=0 – fire‑and‑forget (high loss risk).
acks=1 – leader writes to disk before ack (possible loss if leader fails).
acks=-1 (all) – leader waits for all in‑sync replicas (ISR) to sync before ack.
ISR (in‑sync replica) list is dynamic; lagging replicas are removed from ISR after a timeout.
3.3 Consumer Consistency (HW)
HW (high water mark) is the smallest LEO (log end offset) among ISR replicas; consumers can only read up to HW, ensuring they never see data that might be lost if the leader fails.
3.4 Consumer Mechanics
3.4.1 Consumption Model
Kafka uses pull; consumers request data at their own pace, avoiding overload. A timeout is used when no data is available.
3.4.2 Partition Assignment Strategies
Two strategies:
RoundRobin – works when all consumers subscribe to the same set of topics.
Range (default) – assigns partitions per topic, which can lead to imbalance when a consumer group subscribes to multiple topics.
3.4.3 Offset Management
Offsets are stored per consumer group, topic, and partition. Offsets can be persisted in ZooKeeper (legacy) or in an internal Kafka topic.
3.4.5 Consumer Group Example
Changing the consumer‑group.id, starting multiple consumers, and observing each consumer processing its own partition.
4. High‑Performance Read/Write Mechanisms
4.1 Distributed Deployment
Multiple nodes operate in parallel.
4.2 Sequential Disk Writes
Kafka appends data to the end of log files, achieving high sequential write throughput (≈600 MB/s) compared to random writes.
4.3 Zero‑Copy Transfer
Kafka moves data directly between kernel buffers without copying to user space, greatly improving performance.
5. Role of ZooKeeper in Kafka
ZooKeeper elects a controller broker that manages broker registration, topic partition assignment, and leader election.
-END-
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.