How Kafka Achieves Billion‑Message Throughput with Sequential I/O, Zero‑Copy, and Partitioning
This article explains how Kafka leverages sequential disk I/O, zero‑copy data transfer, OS page cache, and partition‑based parallelism to deliver massive throughput, detailing the underlying mechanisms, performance numbers, and a practical formula for estimating total system capacity.
Sequential I/O (磁盘顺序读写)
Kafka writes messages to its log in an append‑only fashion, allowing both writes and reads to be performed as sequential scans. Sequential I/O fully utilizes disk bandwidth, minimizing seek time and random‑access overhead, which can raise throughput to over 200 MB/s on HDDs. A single partition can sustain more than 200 k TPS, and multiple partitions can aggregate to millions of TPS on a single machine.
Zero‑Copy (零拷贝)
Kafka reduces the cost of moving data between kernel and user space by using zero‑copy system calls such as sendfile(). Instead of the traditional path read file → user buffer → kernel buffer → socket , which incurs multiple memory copies and CPU cycles, sendfile() streams data directly from the page cache to the network socket, eliminating 2‑3 copies and significantly improving network throughput for large data transfers.
Page Cache (页缓存)
Kafka relies on the operating system's page cache rather than managing its own cache. Write operations land in the page cache and are flushed to disk asynchronously, while reads often hit the page cache directly, achieving hit rates above 95 % and latency close to memory speed. This high cache hit ratio reduces disk I/O pressure and improves overall service stability.
Partitioning (分片并行)
Kafka achieves horizontal scalability through partitioning. Each topic is split into multiple partitions, allowing producers and consumers to operate on different partitions concurrently. Brokers can process requests in parallel across multiple CPU cores and disks, distributing load evenly.
Total Throughput = BrokerCount × PerBrokerThroughput × PartitionFactor
Example: 20 brokers × 500,000 msgs/s per broker, replication factor = 3
Read QPS = 20 × 500,000 × 3 = 30,000,000 QPS (replicas share reads)
Write QPS = 20 × 500,000 = 10,000,000 QPS
Daily volume = 10,000,000 msgs/s × 86,400 s = 86.4 trillion messages/dayBy splitting topics into many partitions, producers and consumers can read/write concurrently, brokers can handle requests in parallel, and the partitioning scheme also simplifies replication and fault isolation, ensuring high availability and data consistency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect Chen
Sharing over a decade of architecture experience from Baidu, Alibaba, and Tencent.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
