Big Data 13 min read

Kafka Basics: Topics, Partitions, Producers, Consumers, and Cluster Architecture

This article provides a comprehensive introduction to Kafka, covering its role as a message system, core concepts such as topics, partitions, producers, consumers, messages, the cluster architecture with replicas and controllers, performance optimizations, log segmentation, and network design, all illustrated with diagrams and code examples.

Top Architect
Top Architect
Top Architect
Kafka Basics: Topics, Partitions, Producers, Consumers, and Cluster Architecture

Preface

At the request of many readers, here is a light‑hearted introduction to Kafka before diving into Yarn.

1. Kafka Basics

Message System Role

Think of a message system as a warehouse that temporarily stores data and decouples producers from consumers, similar to the oil‑filled container analogy.

The system acts as a simulated cache; data is still persisted on disk rather than in memory.

1. Topic

A Kafka topic is analogous to a relational database table; it is a logical grouping of messages.

To retrieve data from a specific source, simply subscribe to the corresponding topic (e.g., TopicA).

2. Partition

Partitions are sub‑directories under a topic, distributed across multiple brokers. Each partition stores its data in .log files, similar to database partitioning, improving performance through parallelism.

Multiple partitions enable multiple threads to process data concurrently, which is far faster than single‑threaded processing.

Note: each partition can have replicas to avoid single‑point failures, and partition numbering starts from 0.

3. Producer

Producers send data into the message system.

4. Consumer

Consumers read data from Kafka.

5. Message

The unit of data processed by Kafka is called a message.

2. Kafka Cluster Architecture

Creating a topic with three partitions distributes each partition across different brokers. The topic itself remains a logical concept.

Older Kafka versions (< 0.8) lack replication, which can cause data loss on broker failure.

Replica

Each partition can have multiple replicas for fault tolerance. Typically two replicas are sufficient.

One replica acts as the leader; producers write to the leader, and followers replicate from it. Consumers also read from the leader.

Consumer Group

Consumers belong to a group identified by group.id. If not set, Kafka assigns a default. conf.setProperty("group.id", "tellYourDream") Within a group, only one consumer can read a given partition, preventing duplicate consumption. Different groups can consume the same topic independently.

consumerA:
    group.id = a
consumerB:
    group.id = a

consumerC:
    group.id = b
consumerD:
    group.id = b

Thus, a consumer group enables parallel consumption without overlap.

A partition is consumed by only one consumer in a group, but a consumer can handle multiple partitions when under‑utilized.

Controller

Kafka follows a master‑slave architecture; the controller is the master node that coordinates with ZooKeeper.

Kafka and ZooKeeper Coordination

All brokers register with ZooKeeper on startup, which elects a controller. The controller watches ZooKeeper directories (e.g., /brokers/) to discover brokers and distribute metadata.

When a new topic is created, the controller creates corresponding directories under ZooKeeper, propagates the partition plan to all brokers, and each broker creates its local partition directories.

Additional Topics

1. Why Kafka Performs Well

① Sequential Writes

Sequential disk writes avoid costly seek operations, making disk throughput comparable to memory speed.

② Zero‑Copy

Kafka uses Linux's sendFile (NIO) to transfer data directly from disk to the network socket, eliminating extra copies and context switches.

2. Log Segment Storage

Each partition's .log file is limited to 1 GB to facilitate loading into memory. When a segment reaches the size limit, Kafka rolls over to a new active segment.

00000000000000000000.index
00000000000000000000.log
00000000000000000000.timeindex

00000000000005367851.index
00000000000005367851.log
00000000000005367851.timeindex

00000000000009936472.index
00000000000009936472.log
00000000000009936472.timeindex

3. Kafka Network Design

Clients first connect to an Acceptor, which forwards requests to a pool of Processor threads (default 3) in a round‑robin fashion. Processors hand requests to a thread pool (default 8) that performs reads/writes to disk and sends responses back, forming a three‑layer reactor model.

Increasing the number of processors and thread‑pool workers can improve throughput.

Conclusion

This overview explains Kafka's core concepts, cluster management, performance tricks, and network architecture, providing a solid foundation for further exploration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataKafkaMessage Queue
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.