Kafka Basics: Topics, Partitions, Producers, Consumers, and Cluster Architecture
This article provides a comprehensive introduction to Kafka, covering its role as a message system, core concepts such as topics, partitions, producers, consumers, messages, the cluster architecture with replicas and controllers, performance optimizations, log segmentation, and network design, all illustrated with diagrams and code examples.
Preface
At the request of many readers, here is a light‑hearted introduction to Kafka before diving into Yarn.
1. Kafka Basics
Message System Role
Think of a message system as a warehouse that temporarily stores data and decouples producers from consumers, similar to the oil‑filled container analogy.
The system acts as a simulated cache; data is still persisted on disk rather than in memory.
1. Topic
A Kafka topic is analogous to a relational database table; it is a logical grouping of messages.
To retrieve data from a specific source, simply subscribe to the corresponding topic (e.g., TopicA).
2. Partition
Partitions are sub‑directories under a topic, distributed across multiple brokers. Each partition stores its data in .log files, similar to database partitioning, improving performance through parallelism.
Multiple partitions enable multiple threads to process data concurrently, which is far faster than single‑threaded processing.
Note: each partition can have replicas to avoid single‑point failures, and partition numbering starts from 0.
3. Producer
Producers send data into the message system.
4. Consumer
Consumers read data from Kafka.
5. Message
The unit of data processed by Kafka is called a message.
2. Kafka Cluster Architecture
Creating a topic with three partitions distributes each partition across different brokers. The topic itself remains a logical concept.
Older Kafka versions (< 0.8) lack replication, which can cause data loss on broker failure.
Replica
Each partition can have multiple replicas for fault tolerance. Typically two replicas are sufficient.
One replica acts as the leader; producers write to the leader, and followers replicate from it. Consumers also read from the leader.
Consumer Group
Consumers belong to a group identified by group.id. If not set, Kafka assigns a default. conf.setProperty("group.id", "tellYourDream") Within a group, only one consumer can read a given partition, preventing duplicate consumption. Different groups can consume the same topic independently.
consumerA:
group.id = a
consumerB:
group.id = a
consumerC:
group.id = b
consumerD:
group.id = bThus, a consumer group enables parallel consumption without overlap.
A partition is consumed by only one consumer in a group, but a consumer can handle multiple partitions when under‑utilized.
Controller
Kafka follows a master‑slave architecture; the controller is the master node that coordinates with ZooKeeper.
Kafka and ZooKeeper Coordination
All brokers register with ZooKeeper on startup, which elects a controller. The controller watches ZooKeeper directories (e.g., /brokers/) to discover brokers and distribute metadata.
When a new topic is created, the controller creates corresponding directories under ZooKeeper, propagates the partition plan to all brokers, and each broker creates its local partition directories.
Additional Topics
1. Why Kafka Performs Well
① Sequential Writes
Sequential disk writes avoid costly seek operations, making disk throughput comparable to memory speed.
② Zero‑Copy
Kafka uses Linux's sendFile (NIO) to transfer data directly from disk to the network socket, eliminating extra copies and context switches.
2. Log Segment Storage
Each partition's .log file is limited to 1 GB to facilitate loading into memory. When a segment reaches the size limit, Kafka rolls over to a new active segment.
00000000000000000000.index
00000000000000000000.log
00000000000000000000.timeindex
00000000000005367851.index
00000000000005367851.log
00000000000005367851.timeindex
00000000000009936472.index
00000000000009936472.log
00000000000009936472.timeindex3. Kafka Network Design
Clients first connect to an Acceptor, which forwards requests to a pool of Processor threads (default 3) in a round‑robin fashion. Processors hand requests to a thread pool (default 8) that performs reads/writes to disk and sends responses back, forming a three‑layer reactor model.
Increasing the number of processors and thread‑pool workers can improve throughput.
Conclusion
This overview explains Kafka's core concepts, cluster management, performance tricks, and network architecture, providing a solid foundation for further exploration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
