Big Data 22 min read

Kafka Overview, Architecture, Installation, and Operational Guide

This article provides a comprehensive introduction to Kafka, covering its definition, message queue concepts, architecture components, installation steps, configuration details, startup procedures, operational commands, producer and consumer mechanisms, reliability guarantees, partition strategies, offset management, and performance optimizations.

Top Architect
Top Architect
Top Architect
Kafka Overview, Architecture, Installation, and Operational Guide

1. Kafka Overview

Kafka is a distributed, publish/subscribe based message queue primarily used for real‑time processing in big‑data scenarios.

1.1 Definition

Kafka is a distributed message queue that follows the publish/subscribe model.

1.2 Message Queue

1.2.1 Traditional vs. Modern Queues

Traditional queues require the entire downstream process (e.g., sending an SMS after user registration) to complete before responding to the client. Modern queues allow the system to return a response immediately after persisting data, while subsequent processes run asynchronously.

1.2.2 Benefits of Using a Message Queue

Decoupling

Recoverability

Buffering

Flexibility & peak‑handling capacity

Asynchronous communication

1.2.3 Queue Models

Point‑to‑point: a producer sends a message to a queue, a single consumer retrieves and processes it. Only one consumer can consume a given message.

Publish/Subscribe: a producer publishes to a topic; multiple consumers (subscribers) receive the same message. Kafka uses this model. Consumers can pull messages (pull) or be pushed (push) – Kafka uses pull.

1.3 Kafka Basic Architecture

The core components are brokers, producers, consumer groups, and ZooKeeper.

Producer – sends messages.

Broker – buffers messages, hosts topics, partitions, and replication.

Consumer group – processes messages; consumers in the same group share partitions.

ZooKeeper – stores cluster metadata and consumer offsets (pre‑0.9).

Before version 0.9 offsets were stored in ZooKeeper; from 0.9 onward they are stored in an internal Kafka topic.

1.4 Kafka Installation

A. Install by extracting the tarball:

tar -zxvf kafka_2.11-2.1.1.tgz -C /usr/local/

B. View configuration files:

[root@es1 config]# pwd
/usr/local/kafka/config
[root@es1 config]# ll
... (list of *.properties files) ...

C. Edit server.properties to set broker.id (unique per broker).

D. Set the data storage path (must contain only Kafka data).

E. Configure whether topics can be deleted (default: not allowed).

F. Set data retention time (default 7 days).

G. Set maximum log file size (e.g., 1 GB).

H. Configure ZooKeeper connection address and Kafka timeout.

I. Set default number of partitions.

1.5 Starting Kafka

A. Blocking start (single‑node): each broker must be started manually.

B. Daemon (recommended) start using a background script.

1.6 Kafka Operations

A. List existing topics (connects to ZooKeeper).

B. Create a topic with specified partitions and replication factor.

C. Delete a topic.

D. View topic details.

1.7 Producing and Consuming Messages

A. Start a producer (connects to port 9092).

B. Start a consumer (connects to 9092; pre‑0.9 used 2181).

Multiple consumers can be started to test parallel consumption.

2. Deep Dive into Kafka Architecture

2.1 Kafka Workflow

Messages are categorized by topics; producers write to topics, consumers read from topics. A topic is a logical concept; partitions are physical storage units.

Each partition has replicas and is backed by a log file where messages are appended with an offset. Consumers track the offset they have processed.

2.2 Kafka Internals

To avoid large log files, each partition is split into segments, each consisting of an index file and a log file. The index maps offsets to physical positions, enabling fast binary search.

3. Producers and Consumers

3.1 Producers

Partitions improve concurrency; producers can specify a partition or use round‑robin distribution.

3.2 Reliability via ACKs

Producers receive acknowledgments from the leader after replicas have synced. Three ACK levels:

acks=0 – fire‑and‑forget (high loss risk).

acks=1 – leader writes to disk before ack (possible loss if leader fails).

acks=-1 (all) – leader waits for all in‑sync replicas (ISR) to sync before ack.

ISR (in‑sync replica) list is dynamic; lagging replicas are removed from ISR after a timeout.

3.3 Consumer Consistency (HW)

HW (high water mark) is the smallest LEO (log end offset) among ISR replicas; consumers can only read up to HW, ensuring they never see data that might be lost if the leader fails.

3.4 Consumer Mechanics

3.4.1 Consumption Model

Kafka uses pull; consumers request data at their own pace, avoiding overload. A timeout is used when no data is available.

3.4.2 Partition Assignment Strategies

Two strategies:

RoundRobin – works when all consumers subscribe to the same set of topics.

Range (default) – assigns partitions per topic, which can lead to imbalance when a consumer group subscribes to multiple topics.

3.4.3 Offset Management

Offsets are stored per consumer group, topic, and partition. Offsets can be persisted in ZooKeeper (legacy) or in an internal Kafka topic.

3.4.5 Consumer Group Example

Changing the consumer‑group.id, starting multiple consumers, and observing each consumer processing its own partition.

4. High‑Performance Read/Write Mechanisms

4.1 Distributed Deployment

Multiple nodes operate in parallel.

4.2 Sequential Disk Writes

Kafka appends data to the end of log files, achieving high sequential write throughput (≈600 MB/s) compared to random writes.

4.3 Zero‑Copy Transfer

Kafka moves data directly between kernel buffers without copying to user space, greatly improving performance.

5. Role of ZooKeeper in Kafka

ZooKeeper elects a controller broker that manages broker registration, topic partition assignment, and leader election.

-END-

distributed systemsbig dataKafkaMessage QueueConsumerinstallationproducer
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.