Understanding Kafka Zero‑Copy and Parallel FileTransferTo Performance

This article explains Kafka's underlying storage architecture, details the zero‑copy transferTo technique in Java, compares traditional four‑copy I/O with zero‑copy, and presents parallel FileTransferTo performance tests on a multi‑core Linux system environment.

Top Architect
Top Architect
Top Architect
Understanding Kafka Zero‑Copy and Parallel FileTransferTo Performance

1. Introduction

Previously we investigated large‑scale log streams with high‑throughput parallel storage and examined Kafka's low‑level storage mechanism. We discovered that Kafka's zero‑copy implementation relies on Java's FileTransferTo method, and we later implemented a parallel TransferTo approach and integrated it into Apache Kafka.

2. Message Storage Mechanism

Kafka is a distributed publish‑subscribe messaging system. Each topic is logical and consists of one or more partitions, which may reside on different brokers. Physically, a partition maps to a directory containing multiple segments; each segment holds a data file and a corresponding index file. Conceptually, a partition can be viewed as a very long array that can be accessed by its offset.

3. Kafka's Zero‑Copy Technique

In Kafka, messages are stored on the underlying file system. When a consumer subscribes to a topic, the data must be read from disk and written back to a socket. The naïve approach copies data from kernel to user space and back, causing four copies and four context switches, which wastes CPU cycles and memory bandwidth.

Zero‑copy eliminates these extra copies by allowing the kernel to transfer data directly from the file descriptor to the socket descriptor, bypassing user space. Java provides this capability through java.nio.channels.FileChannel.transferTo(), which internally invokes the OS sendfile() system call on Linux/UNIX.

3.1 Traditional Four Copies and Four Context Switches

Consider reading data from a file and sending it over the network:

File.read(fileDesc, buf, len);
Socket.send(socket, buf, len);

The process involves:

read() triggers a user‑to‑kernel context switch; the kernel reads data into a kernel buffer via DMA.

The data is copied from the kernel buffer to a user buffer and read() returns, causing a kernel‑to‑user switch.

send() triggers another user‑to‑kernel switch; the data is copied into a kernel socket buffer.

send() returns, causing a final kernel‑to‑user switch, after which DMA moves the data to the NIC.

Although an intermediate kernel buffer seems inefficient, it can act as a read‑ahead cache and enables asynchronous writes.

3.2 Applying Zero‑Copy in Kafka

By using transferTo(), the redundant second and third copies are removed. The method streams data directly from the file channel to the socket channel.

TransferTo method signature:

public void transferTo(long position, long count, WritableByteChannel target);

Internally, on Linux it maps to the sendfile() system call:

#include <sys/socket.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

The steps are:

DMA copies file data into a kernel read buffer, then the kernel moves it to the socket’s kernel buffer.

A second DMA copy moves data from the kernel socket buffer to the NIC, eliminating the final user‑space copy.

When the network interface supports scatter‑gather, the kernel can further reduce copies, cutting context switches from four to two and data copies from four to three.

4. Parallel FileTransferTo Performance Test

We evaluated the parallel version of FileTransferTo using multiple threads to see if parallel I/O improves throughput.

Test environment:

CentOS 5.10 Intel Xeon E7420 @ 2.13 GHz 16 logical CPUs 16 GB RAM Test file size: 1.2 GB

Result: the parallel implementation performed worse than the serial version, indicating that the underlying storage or kernel configuration limited scalability. The source code is available at https://github.com/Tjcug/kafkaParallelIO .

Overall, the article demonstrates how Kafka leverages zero‑copy to reduce CPU overhead and context switches, and it provides a practical performance comparison of parallel versus serial file transfer on a typical server configuration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendJavaperformanceFileTransferToZero-Copy
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.