Big Data 13 min read

Understanding Kafka: Core Concepts, Architecture, and Performance Secrets

This article introduces Kafka's role as a message system, explains its fundamental components such as topics, partitions, producers, consumers, and replicas, and dives into cluster architecture, consumer groups, Zookeeper coordination, and performance optimizations like sequential writes, zero‑copy, log segmentation, and network design.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Understanding Kafka: Core Concepts, Architecture, and Performance Secrets

Preface

Based on many readers' requests, here is a light‑hearted introduction to Kafka before diving into Yarn.

1. Kafka Basics

Message System Role

Think of a message system as a warehouse that buffers data and decouples producers from consumers.

For example, if China Mobile, China Unicom, and China Telecom outsource log processing for user profiling, your system would receive those logs.

The message system acts as a simulated cache: it provides caching functionality, but the data is stored on disk rather than in memory.

1. Topic

Kafka adopts a database‑like design and introduces topics, which are analogous to tables in a relational database.

To retrieve data from China Mobile, you would simply subscribe to TopicA .

2. Partition

A partition is a directory under a topic, distributed across multiple brokers; each partition stores data in .log files, similar to database partitions, improving performance through parallelism.

Multiple partitions enable multiple threads to process data concurrently.

Conceptually, a topic is a logical entity, while a partition is the physical storage unit, akin to HBase tables and regions or HDFS blocks.

Note: Partitions can suffer single‑point failures, so replicas are configured. Partition numbering starts from 0.

3. Producer

Producers send data to the message system.

4. Consumer

Consumers read data from Kafka.

5. Message

Data stored in Kafka is called a message.

2. Kafka Cluster Architecture

Creating TopicA with three partitions distributed across different brokers illustrates that a topic is a logical concept and cannot be directly drawn as a physical unit.

Important: Versions prior to 0.8 lack replication, risking data loss on broker failure.

Replica

Each partition can have multiple replicas for fault tolerance; typically two replicas are sufficient.

One replica acts as the leader, while the others are followers. Producers write to the leader, and followers synchronize from it; consumers also read from the leader.

Consumer Group

Consumers belong to a group identified by group.id. If not set, Kafka assigns a default.

<ol><li><p>conf.setProperty("group.id","tellYourDream")</p></li></ol>

Only one consumer in a group can consume a particular partition's data; different groups can consume the same partition independently.

<ol><li><p>consumerA:</p></li><li><p>group.id = a</p></li><li><p>consumerB:</p></li><li><p>group.id = a</p></li><li><p>consumerC:</p></li><li><p>group.id = b</p></li><li><p>consumerD:</p></li><li><p>group.id = b</p></li></ol>

Thus, a consumer group enables parallel consumption without duplicate processing, while a single consumer can handle multiple partitions when under‑utilized.

Controller

Kafka follows a master‑slave architecture; the master node is called the controller, which works with Zookeeper to manage the cluster.

Kafka and Zookeeper Coordination

All brokers register themselves in Zookeeper at startup, triggering a simple election to select the controller.

The controller watches Zookeeper directories (e.g., /brokers/), reads broker metadata, and distributes it to other brokers.

When a new topic is created (e.g., /topics/TopicA), the controller detects the change, propagates the partition metadata, and instructs brokers to create the corresponding directories for replica storage.

Supplementary Topics

1. Why Kafka Performs Well

① Sequential Writes

Kafka writes data sequentially to disk, which is almost as fast as memory writes because it avoids costly seek operations inherent in random writes.

② Zero‑Copy

Kafka uses Linux's sendFile (NIO) to transfer data directly from disk to the network socket, eliminating extra memory copies and context switches.

2. Log Segment Storage

Each partition's .log file is limited to 1 GB to facilitate loading into memory. When a segment reaches this size, Kafka rolls over to a new segment (log rolling). The active segment is the one currently being written.

<ol><li><p>00000000000000000000.index</p></li><li><p>00000000000000000000.log</p></li><li><p>00000000000000000000.timeindex</p></li><li><p>00000000000005367851.index</p></li><li><p>00000000000005367851.log</p></li><li><p>00000000000005367851.timeindex</p></li><li><p>00000000000009936472.index</p></li><li><p>00000000000009936472.log</p></li><li><p>00000000000009936472.timeindex</p></li></ol>

3. Kafka Network Design

Clients first connect to an Acceptor, which forwards requests to a pool of processor threads (default three). Processors place requests into a queue; a thread pool (default eight threads) handles the actual I/O, writing to disk for produce requests and returning data for fetch requests. This three‑layer network model can be tuned by increasing processor count and thread pool size.

Finally

The cluster setup will be covered later. This article provides a simple overview of Kafka's roles and design; future updates will delve deeper.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceKafkaMessage Queue
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.