How to Size a Kafka Cluster for Over 1 Billion Daily Requests
This article walks through a scenario‑driven capacity assessment for a production‑grade Kafka cluster, covering QPS calculations, storage needs, physical machine count, disk choices, memory, CPU, network bandwidth, deployment steps, and a final resource summary.
Scenario Analysis
The article uses a typical e‑commerce platform as a case study where the Kafka cluster must handle more than 1 billion request messages per day. By applying the 80/20 rule, 80% of the data (≈800 M messages) arrives in the remaining 16 hours, and 80% of that (≈640 M) arrives within 20% of the time (≈3 hours).
1. Daily request load
Peak‑hour QPS is calculated as 640 000 000 ÷ (3 × 60 × 60) ≈ 60 000 QPS . Assuming an average record size of 20 KB, the daily data volume is 1 000 000 000 × 20 KB = 18 TB . With a replication factor of 3, the stored data becomes 54 TB , and keeping three days of retention results in 162 TB of required storage.
Scenario Summary
To sustain >1 billion requests, the cluster must support ~60 k QPS and provide roughly 162 TB of storage capacity.
Physical Machine Count
For high‑performance services like Kafka, MySQL, and Hadoop, physical servers are preferred over virtual machines due to better stability and performance.
Machine count calculation
Based on the earlier analysis, a single physical machine can safely handle about 40 k QPS. To cover the required 60 k QPS, about 5 machines are needed; accounting for consumer load, the recommendation rises to 7 physical machines .
Disk Capacity
Kafka writes data sequentially, so ordinary mechanical HDDs are sufficient for the log storage, while SSDs are better for random‑access workloads like MySQL.
SSD: very fast random read/write, high cost, not ideal for large‑capacity storage.
HDD: low cost, suitable for large capacity, slower random access.
Each of the 7 machines should be equipped with 11 disks of ~2 TB usable capacity (3 TB raw per disk, reserving space for OS and performance headroom), giving a total of ≈23 TB per machine and meeting the 162 TB overall requirement.
Memory Requirements
Kafka relies heavily on the OS page cache; therefore, ample memory should be allocated to the cache. The JVM running Kafka typically needs 6–10 GB . To keep about 25% of the log data in memory, the calculation yields ≈375 GB total, which translates to ≈54 GB per machine . Adding JVM memory leads to a recommendation of 128 GB RAM per server .
CPU and Threading
Kafka’s performance depends on the number of threads, which map to CPU cores. The analysis estimates over 100 active threads per broker. To avoid CPU saturation, the guide suggests at least 16 CPU cores per machine (32 cores for better performance).
Network Card
Bandwidth calculations show a peak inbound traffic of ≈552 MB/s (including replication). A 1 GbE NIC can typically sustain ~700 Mbps of usable throughput, which is sufficient, while a 10 GbE NIC offers additional headroom.
Cluster Deployment
The recommended topology uses five Kafka brokers plus ZooKeeper, all running on CentOS 7.7. Kafka version 2.4.1 (Scala 2.11) is paired with ZooKeeper 3.5.7 or newer.
Download and Install
tar -zxvf kafka_2.11-2.4.1.tgz -C /usr/local
mv /usr/local/kafka_2.11-2.4.1 /usr/local/kafka
useradd kafka
chown -R kafka:kafka /usr/local/kafkaConfiguration (example)
broker.id=1
listeners=PLAINTEXT://172.16.213.31:9092
log.dirs=/usr/local/kafka/logs
num.partitions=6
log.retention.hours=72
log.segment.bytes=1073741824
zookeeper.connect=172.16.213.31:2181,172.16.213.32:2181,172.16.213.33:2181
auto.create.topics.enable=true
delete.topic.enable=true
num.network.threads=9
num.io.threads=32
message.max.bytes=10485760
log.flush.interval.message=10000
log.flush.interval.ms=1000
replica.lag.time.max.ms=10Start the Cluster
cd /usr/local/kafka
nohup bin/kafka-server-start.sh config/server.properties &
jpsThe jps command should list a Kafka process, confirming a successful start.
Overall Summary
To handle >1 billion daily requests, the evaluated Kafka deployment requires:
≈60 k QPS peak capacity
7 physical servers
Each server: 128 GB RAM, 16 CPU cores (32 cores optional), 11 × 2 TB HDDs
1 GbE NIC sufficient; 10 GbE recommended for future growth
These resources ensure the cluster can sustain the workload with safety margins for CPU, memory, disk, and network.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
