Why Is Kafka So Fast? Unveiling the Secrets Behind Its High Throughput
This article explains how Kafka achieves remarkable speed and massive throughput by using sequential disk I/O, OS page cache, zero‑copy transfers, partitioned log segments with indexes, batch processing, and efficient compression, making it a cornerstone of modern big‑data pipelines.
Kafka is a ubiquitous messaging middleware in the big data field, widely used for real‑time data pipelines and stream processing.
Although Kafka stores data on disk, it achieves high performance, high throughput and low latency, often handling tens of thousands to millions of messages per second.
1. Sequential Read/Write
Kafka appends messages to the end of log files, using sequential disk I/O, which is orders of magnitude faster than random I/O; this design dramatically improves write throughput.
Each partition is a separate file; data is never deleted, and consumers track their position with offsets stored by the client (often in ZooKeeper).
Kafka provides two retention policies—time‑based and size‑based—to eventually discard old data.
2. Page Cache
Kafka leverages the operating system’s page cache instead of JVM heap memory, avoiding object overhead and garbage‑collection pauses, and benefiting from OS‑level optimizations such as write‑behind, read‑ahead, and flush.
3. Zero‑Copy
Linux’s zero‑copy sendfile moves data directly from the kernel page cache to the network socket, eliminating extra copies between kernel and user space and greatly reducing latency.
The data flow without zero‑copy involves four copies; with zero‑copy Kafka skips the user‑space copy.
4. Partitioning, Segmentation & Indexing
Messages are stored per topic, then split into partitions and further into segments; each segment has an accompanying .index file, enabling efficient reads and parallel processing.
5. Batch I/O
Both reads and writes are performed in batches; producers can enable batch writes to reduce network round‑trips, and consumers read batches of records.
6. Batch Compression
Kafka compresses batches of messages (e.g., using Gzip or Snappy) rather than individual messages, reducing network I/O while preserving the ability to decompress on the consumer side.
Overall, Kafka’s speed stems from sequential disk writes, OS page cache, zero‑copy transfers, partitioned log segments with indexes, and batch processing with compression.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
