Big Data 22 min read

Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment

This comprehensive guide explains Kafka's role as a message system, detailing topics, partitions, producers, consumers, replication, controller, ZooKeeper coordination, performance optimizations like sequential writes and zero‑copy, and practical recommendations for hardware, configuration, and cluster deployment.

Open Source Linux
Open Source Linux
Open Source Linux
Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment

Kafka Basics

Kafka serves as a message system that acts like a warehouse, providing caching and decoupling between producers and consumers.

Message System Role

It stores data on disk rather than in memory, but functions as a cache for intermediate processing.

Topic and Partition

A topic in Kafka is analogous to a table in a relational database, while partitions are similar to HBase regions, distributing data across multiple servers for scalability and performance.

Partitions are stored as directories on servers, with data kept in .log files; multiple partitions enable parallel processing.

Producer and Consumer

Producers send data to Kafka, and consumers read data from Kafka.

Message

The unit of data processed in Kafka is called a message.

Kafka Cluster Architecture

Each topic can have multiple partitions, each replicated across brokers for fault tolerance.

Replica

Partitions have replicas; one replica is elected as the leader, while others are followers that synchronize data from the leader.

Consumer Group

Consumers belong to a consumer group identified by group.id; only one consumer in a group reads a given partition, enabling parallel consumption without duplicate processing.

conf.setProperty("group.id", "tellYourDream")

Controller

The controller node, elected via ZooKeeper, manages the cluster, monitors broker registrations, and distributes metadata.

Kafka and ZooKeeper Coordination

All brokers register with ZooKeeper, which stores metadata such as topics and partitions. The controller watches ZooKeeper directories to synchronize cluster state.

Performance Highlights

Sequential Write

Kafka writes data sequentially to disk, achieving near‑memory speeds due to reduced seek time.

Zero‑Copy

Kafka uses Linux's sendFile to transfer data directly from disk to network sockets, eliminating extra memory copies.

Log Segment Storage

Each partition's log file is limited to 1 GB to facilitate loading into memory; when full, a new segment is created (log rolling).

Network Design

Clients connect to an acceptor thread, which distributes requests to processor threads (default 3). A thread pool (default 8) handles I/O, enabling high concurrency.

Production Cluster Deployment

For a workload of 1 billion records per day with peak 60 k records/s, the design recommends:

5 physical machines, each with ~56 TB storage (total ~276 TB for 3‑day retention).

Use SAS disks (mechanical) as sequential writes perform well; SSDs are optional for random‑access workloads.

Memory: ~64 GB per node, allocating ~10 GB to the JVM and the rest to OS cache.

CPU: 16‑32 cores per node to handle hundreds of broker threads.

Network: 1 Gbps is sufficient but 10 Gbps is preferable for high‑throughput replication.

Key Configuration Parameters

broker.id

: Unique ID for each broker. log.dirs: Directories for storing log files; can span multiple disks. zookeeper.connect: ZooKeeper connection string. listeners: Port for client connections (default 9092). num.network.threads and num.io.threads: Thread counts for network and I/O processing. unclean.leader.election.enable: Controls leader election safety. log.retention.hours: Retention period for log data. min.insync.replicas: Minimum number of replicas that must acknowledge writes.

Basic Commands

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 2 --topic tellYourDream
bin/kafka-topics.sh --list --zookeeper localhost:2181
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

Performance Testing

bin/kafka-producer-perf-test.sh --topic test-topic --num-records 500000 --record-size 200 --throughput -1 --producer-props bootstrap.servers=hadoop03:9092,hadoop04:9092,hadoop05:9092 acks=-1
bin/kafka-consumer-perf-test.sh --broker-list hadoop03:9092,hadoop04:9092,hadoop05:9092 --fetch-size 2000 --messages 500000 --topic test-topic

Management Tools

KafkaManager

A Scala‑based web UI for managing multiple Kafka clusters, monitoring topics, brokers, partitions, and performing administrative actions such as creating topics, adding partitions, and reassigning replicas.

KafkaOffsetMonitor

A Java tool for monitoring consumer lag and offset information.

Original article: https://juejin.cn/post/6844904001989771278

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBig DataKafkaMessage QueueCluster Deployment
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.