Big Data 9 min read

Why Kafka Handles Millions of Writes per Second: I/O Secrets Revealed

This article explains how Kafka achieves ultra‑high throughput by using sequential disk writes, memory‑mapped files, zero‑copy sendfile reads, and batch compression, while also describing its data retention policies and the trade‑offs of synchronous versus asynchronous producer modes.

ITPUB

Jul 1, 2019

Why Kafka Handles Millions of Writes per Second: I/O Secrets Revealed

1. Writing Data

Kafka writes every received message to disk and never loses data. It boosts write speed with two techniques: sequential I/O and memory‑mapped files (MMFile).

Sequential Write

Disk performance depends on access pattern; sequential reads/writes can approach memory speed, whereas random I/O incurs costly seek operations. Kafka therefore appends messages to the end of each partition file, avoiding random writes.

Sequential disk I/O can be faster than random memory access.

JVM garbage‑collection overhead is reduced because data stays on disk.

Disk cache remains useful after a cold start.

Kafka cannot delete data; it retains all logs and tracks consumer progress with offsets stored in Zookeeper.

Data retention is controlled by two policies: time‑based and size‑based.

Memory‑Mapped Files (mmap)

Kafka maps log files into the process address space. Writes go to the mapped memory and are flushed to disk by the OS, eliminating user‑space to kernel‑space copies and allowing large I/O gains. The trade‑off is that data may not be persisted until a flush, so producers can choose synchronous (flush immediately) or asynchronous mode via producer.type.

2. Reading Data

Kafka optimizes reads with Zero‑Copy sendfile and batch compression.

Zero‑Copy sendfile

Traditional read/write involves multiple copies: disk → kernel buffer → user buffer → socket buffer → protocol engine. The sendfile syscall copies data directly from the kernel file cache to the socket buffer, reducing copies and context switches. Modern kernels further streamline this path.

Kafka stores each partition as a file; when a consumer requests data, the broker uses sendfile (combined with mmap) to transfer the file efficiently.

Batch Compression

Network I/O is often the bottleneck, so Kafka compresses batches of messages rather than individual ones. Supported codecs include Gzip and Snappy. Batch compression reduces network traffic while preserving message order for consumer decompression.

3. Summary

Kafka achieves high throughput by appending messages to sequential log files, using memory‑mapped files for fast I/O, retaining data with configurable time‑ or size‑based policies, and reading data via Zero‑Copy sendfile together with batch compression, thus minimizing disk and network overhead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

sendfile high-throughput Memory Mapped Files sequential-io Batch Compression

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.