How to Build a High‑Performance Kafka Production Cluster: Sizing, Config, and Best Practices
This guide explains how to design and deploy a Kafka production cluster—including capacity planning for 1 billion daily messages, hardware sizing, key configuration parameters, command‑line operations, and useful management tools—to achieve reliable high‑throughput streaming.
1. Kafka Production Cluster Deployment
Background : Assume the cluster must handle 1 billion messages per day. About 80% of the data (800 million) arrives within 16 hours, and 80% of that (640 million) arrives in the peak 3‑hour window, requiring roughly 60 000 QPS.
Disk space : Each message is ~50 KB, so daily raw data is 46 TB. With two replicas this becomes 92 TB, and keeping three days of data requires about 276 TB of storage.
Hardware sizing :
QPS: A single physical machine can sustain 40‑50 k QPS; therefore 5‑7 machines (total 200‑300 k QPS) provide a safe headroom of 3‑4× peak load.
Disk: Five servers storing 276 TB means roughly 56 TB per server; choose appropriate number and size of disks.
SSD vs. SAS: Kafka writes sequentially, so high‑capacity SAS drives are acceptable for cost‑sensitive deployments, while SSDs are recommended for workloads with random I/O (e.g., MySQL).
Memory: Allocate ~10 GB JVM heap for Kafka and use the rest for OS cache. For 100 topics with 5 partitions each (500 partitions) and 2 replicas, about 50 GB of cache is sufficient; a 64 GB server is adequate.
CPU: 16‑core CPUs are typical; they can handle 100‑200 threads per broker. More cores (32) give extra headroom.
Network: Peak inbound traffic is ~488 MB/s; with replication the required bandwidth is ~976 MB/s, so 10 GbE is preferable, though 1 GbE may be a bottleneck.
2. Kafka Cluster Configuration
Edit server.properties (only this file is needed). Important settings include: broker.id: unique ID per broker (0‑255). log.dirs: directories for log storage; multiple comma‑separated paths spread data across disks. zookeeper.connect: Zookeeper ensemble address. listeners: client connection port (default 9092). num.network.threads (default 3) and num.io.threads (default 8). unclean.leader.election.enable: false (only ISR followers can become leader). delete.topic.enable: true (allow topic deletion). log.retention.hours: retention period (default 7 days). min.insync.replicas: controls acks behavior for durability.
3. Basic Kafka Cluster Operations
Typical commands (adjusted for your environment):
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 2 --topic tellYourDreamList topics:
bin/kafka-topics.sh --list --zookeeper localhost:2181Produce messages:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic testConsume messages:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginningPerformance test (produce 500 k records, 200 bytes each):
bin/kafka-producer-perf-test.sh --topic test-topic --num-records 500000 --record-size 200 --throughput -1 --producer-props bootstrap.servers=hadoop03:9092,hadoop04:9092,hadoop05:9092 acks=-1Consume test:
bin/kafka-consumer-perf-test.sh --broker-list hadoop03:9092,hadoop04:9092,hadoop53:9092 --fetch-size 2000 --messages 500000 --topic test-topic4. Management Tools
KafkaManager (Scala‑based) helps monitor brokers, topics, partitions, and perform administrative actions such as creating, deleting, or re‑partitioning topics.
KafkaOffsetMonitor provides consumer lag visibility; it is a simple JAR that can be started with a Java command.
Additional tools like MirrorMaker can be used for cross‑datacenter replication when needed.
Further articles will cover producer/consumer internals and common interview questions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
