Why Kafka Handles Millions of Messages Per Second: Batch, Partition, Zero‑Copy, and Compression Explained
This article breaks down the core techniques that give Kafka its high‑throughput capability, including producer batch settings (batch.size, linger.ms), broker append‑only writes, consumer poll configuration, partition distribution, zero‑copy data transfer, dual‑thread processing, and configurable compression algorithms.
High‑Throughput Design
Kafka achieves “fast” by delivering high throughput – the ability to handle millions of records per second. This is realized through a set of low‑overhead mechanisms in the producer, broker and consumer.
Batch Processing
Both producers and consumers work with batches.
Producer side : batch.size (default 16384 bytes) defines the maximum batch size in bytes. linger.ms (default 0 ms) defines how long the producer will wait for the batch to fill before sending. A batch is sent when either limit is reached.
Broker side : Incoming batches are appended to the partition log, which reduces disk I/O by writing sequentially.
Consumer side : poll() returns a batch of records. The max.poll.records setting (default 500) caps the number of records per poll.
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> r : records) {
System.out.printf("Offset=%d, Key=%s, Value=%s%n",
r.offset(), r.key(), r.value());
}
}Partition Mechanism
Messages are distributed across partitions of a topic. If the record key is null, the default partitioner uses a round‑robin algorithm. If a key is present, the partitioner hashes the key and assigns the record to a partition based on the hash value, providing load‑balancing and ordering guarantees per key.
Zero‑Copy Transfer
Traditional I/O involves four memory copies and four context switches (disk → read buffer → application buffer → socket buffer → NIC). Kafka’s zero‑copy reduces this to two DMA copies plus a single pointer copy:
Broker reads data from disk directly into the OS page cache (read buffer) via DMA.
The read buffer is memory‑mapped to the socket buffer (pointer copy, no data copy).
The socket buffer is DMA‑transferred to the NIC buffer.
This cuts CPU overhead and improves throughput.
Dual‑Thread Producer
The producer uses two threads:
Main thread creates ProducerRecord objects, runs them through interceptors, serializers and the partitioner, and stores them in the message accumulator.
Sender thread pulls full batches from the accumulator and performs the actual network I/O.
ProducerRecord<String, String> rec =
new ProducerRecord<>("Topic1", "12345", "order_event");
producer.send(rec);Compression
The producer can compress whole batches with the compression.type property (default none). Supported algorithms are gzip, snappy, lz4 and zstd. zstd gives the highest compression ratio, while lz4 provides the lowest CPU cost and fastest (de)compression. The broker stores and forwards the compressed bytes without decompressing; consumers decompress locally.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Tony
Former senior tech manager at Meituan, ex‑tech director at New Oriental, with experience at JD.com and Qunar; specializes in Java interview coaching and regularly shares hardcore technical content. Runs a video channel of the same name.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
