How to Size a Kafka Cluster for Over 1 Billion Daily Requests

This article walks through a scenario‑driven capacity assessment for a production‑grade Kafka cluster, covering QPS calculations, storage needs, physical machine count, disk choices, memory, CPU, network bandwidth, deployment steps, and a final resource summary.

dbaplus Community
dbaplus Community
dbaplus Community
How to Size a Kafka Cluster for Over 1 Billion Daily Requests

Scenario Analysis

The article uses a typical e‑commerce platform as a case study where the Kafka cluster must handle more than 1 billion request messages per day. By applying the 80/20 rule, 80% of the data (≈800 M messages) arrives in the remaining 16 hours, and 80% of that (≈640 M) arrives within 20% of the time (≈3 hours).

1. Daily request load

Peak‑hour QPS is calculated as 640 000 000 ÷ (3 × 60 × 60) ≈ 60 000 QPS . Assuming an average record size of 20 KB, the daily data volume is 1 000 000 000 × 20 KB = 18 TB . With a replication factor of 3, the stored data becomes 54 TB , and keeping three days of retention results in 162 TB of required storage.

Scenario Summary

To sustain >1 billion requests, the cluster must support ~60 k QPS and provide roughly 162 TB of storage capacity.

Physical Machine Count

For high‑performance services like Kafka, MySQL, and Hadoop, physical servers are preferred over virtual machines due to better stability and performance.

Machine count calculation

Based on the earlier analysis, a single physical machine can safely handle about 40 k QPS. To cover the required 60 k QPS, about 5 machines are needed; accounting for consumer load, the recommendation rises to 7 physical machines .

Disk Capacity

Kafka writes data sequentially, so ordinary mechanical HDDs are sufficient for the log storage, while SSDs are better for random‑access workloads like MySQL.

SSD: very fast random read/write, high cost, not ideal for large‑capacity storage.

HDD: low cost, suitable for large capacity, slower random access.

Each of the 7 machines should be equipped with 11 disks of ~2 TB usable capacity (3 TB raw per disk, reserving space for OS and performance headroom), giving a total of ≈23 TB per machine and meeting the 162 TB overall requirement.

Memory Requirements

Kafka relies heavily on the OS page cache; therefore, ample memory should be allocated to the cache. The JVM running Kafka typically needs 6–10 GB . To keep about 25% of the log data in memory, the calculation yields ≈375 GB total, which translates to ≈54 GB per machine . Adding JVM memory leads to a recommendation of 128 GB RAM per server .

CPU and Threading

Kafka’s performance depends on the number of threads, which map to CPU cores. The analysis estimates over 100 active threads per broker. To avoid CPU saturation, the guide suggests at least 16 CPU cores per machine (32 cores for better performance).

Network Card

Bandwidth calculations show a peak inbound traffic of ≈552 MB/s (including replication). A 1 GbE NIC can typically sustain ~700 Mbps of usable throughput, which is sufficient, while a 10 GbE NIC offers additional headroom.

Cluster Deployment

The recommended topology uses five Kafka brokers plus ZooKeeper, all running on CentOS 7.7. Kafka version 2.4.1 (Scala 2.11) is paired with ZooKeeper 3.5.7 or newer.

Download and Install

tar -zxvf kafka_2.11-2.4.1.tgz -C /usr/local
mv /usr/local/kafka_2.11-2.4.1 /usr/local/kafka
useradd kafka
chown -R kafka:kafka /usr/local/kafka

Configuration (example)

broker.id=1
listeners=PLAINTEXT://172.16.213.31:9092
log.dirs=/usr/local/kafka/logs
num.partitions=6
log.retention.hours=72
log.segment.bytes=1073741824
zookeeper.connect=172.16.213.31:2181,172.16.213.32:2181,172.16.213.33:2181
auto.create.topics.enable=true
delete.topic.enable=true
num.network.threads=9
num.io.threads=32
message.max.bytes=10485760
log.flush.interval.message=10000
log.flush.interval.ms=1000
replica.lag.time.max.ms=10

Start the Cluster

cd /usr/local/kafka
nohup bin/kafka-server-start.sh config/server.properties &
jps

The jps command should list a Kafka process, confirming a successful start.

Overall Summary

To handle >1 billion daily requests, the evaluated Kafka deployment requires:

≈60 k QPS peak capacity

7 physical servers

Each server: 128 GB RAM, 16 CPU cores (32 cores optional), 11 × 2 TB HDDs

1 GbE NIC sufficient; 10 GbE recommended for future growth

These resources ensure the cluster can sustain the workload with safety margins for CPU, memory, disk, and network.

Kafka write flow diagram
Kafka write flow diagram
Kafka high‑concurrency network architecture
Kafka high‑concurrency network architecture
Final resource table
Final resource table
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceBackend DevelopmentKafkacapacity planningCluster Sizing
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.