Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment
This comprehensive guide explains Kafka's role as a message system, detailing topics, partitions, producers, consumers, replication, controller, ZooKeeper coordination, performance optimizations like sequential writes and zero‑copy, and practical recommendations for hardware, configuration, and cluster deployment.
Kafka Basics
Kafka serves as a message system that acts like a warehouse, providing caching and decoupling between producers and consumers.
Message System Role
It stores data on disk rather than in memory, but functions as a cache for intermediate processing.
Topic and Partition
A topic in Kafka is analogous to a table in a relational database, while partitions are similar to HBase regions, distributing data across multiple servers for scalability and performance.
Partitions are stored as directories on servers, with data kept in .log files; multiple partitions enable parallel processing.
Producer and Consumer
Producers send data to Kafka, and consumers read data from Kafka.
Message
The unit of data processed in Kafka is called a message.
Kafka Cluster Architecture
Each topic can have multiple partitions, each replicated across brokers for fault tolerance.
Replica
Partitions have replicas; one replica is elected as the leader, while others are followers that synchronize data from the leader.
Consumer Group
Consumers belong to a consumer group identified by group.id; only one consumer in a group reads a given partition, enabling parallel consumption without duplicate processing.
conf.setProperty("group.id", "tellYourDream")Controller
The controller node, elected via ZooKeeper, manages the cluster, monitors broker registrations, and distributes metadata.
Kafka and ZooKeeper Coordination
All brokers register with ZooKeeper, which stores metadata such as topics and partitions. The controller watches ZooKeeper directories to synchronize cluster state.
Performance Highlights
Sequential Write
Kafka writes data sequentially to disk, achieving near‑memory speeds due to reduced seek time.
Zero‑Copy
Kafka uses Linux's sendFile to transfer data directly from disk to network sockets, eliminating extra memory copies.
Log Segment Storage
Each partition's log file is limited to 1 GB to facilitate loading into memory; when full, a new segment is created (log rolling).
Network Design
Clients connect to an acceptor thread, which distributes requests to processor threads (default 3). A thread pool (default 8) handles I/O, enabling high concurrency.
Production Cluster Deployment
For a workload of 1 billion records per day with peak 60 k records/s, the design recommends:
5 physical machines, each with ~56 TB storage (total ~276 TB for 3‑day retention).
Use SAS disks (mechanical) as sequential writes perform well; SSDs are optional for random‑access workloads.
Memory: ~64 GB per node, allocating ~10 GB to the JVM and the rest to OS cache.
CPU: 16‑32 cores per node to handle hundreds of broker threads.
Network: 1 Gbps is sufficient but 10 Gbps is preferable for high‑throughput replication.
Key Configuration Parameters
broker.id: Unique ID for each broker. log.dirs: Directories for storing log files; can span multiple disks. zookeeper.connect: ZooKeeper connection string. listeners: Port for client connections (default 9092). num.network.threads and num.io.threads: Thread counts for network and I/O processing. unclean.leader.election.enable: Controls leader election safety. log.retention.hours: Retention period for log data. min.insync.replicas: Minimum number of replicas that must acknowledge writes.
Basic Commands
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 2 --topic tellYourDream bin/kafka-topics.sh --list --zookeeper localhost:2181 bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginningPerformance Testing
bin/kafka-producer-perf-test.sh --topic test-topic --num-records 500000 --record-size 200 --throughput -1 --producer-props bootstrap.servers=hadoop03:9092,hadoop04:9092,hadoop05:9092 acks=-1 bin/kafka-consumer-perf-test.sh --broker-list hadoop03:9092,hadoop04:9092,hadoop05:9092 --fetch-size 2000 --messages 500000 --topic test-topicManagement Tools
KafkaManager
A Scala‑based web UI for managing multiple Kafka clusters, monitoring topics, brokers, partitions, and performing administrative actions such as creating topics, adding partitions, and reassigning replicas.
KafkaOffsetMonitor
A Java tool for monitoring consumer lag and offset information.
Original article: https://juejin.cn/post/6844904001989771278
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
