How Kafka Achieves High Throughput: Architecture, GC Tweaks, and Memory Buffer Pools
This article explains how Kafka’s architecture, use of OS page cache, Sendfile optimization, and a custom memory buffer pool work together to minimize JVM garbage collection overhead and deliver the massive throughput required by big‑data messaging workloads.
Everyone knows Kafka is a high‑throughput message queue, the first choice for big‑data scenarios where massive amounts of messages are sent per unit time. This article explains how Kafka’s architecture and GC optimizations enable such scale.
Kafka Architecture
Key terminology:
Topic : logical grouping of messages; a topic can be distributed across multiple brokers.
Partition : the unit of horizontal scaling and parallelism; each topic has at least one partition.
Offset : the sequential number of a message within a partition.
Consumer : reads/consumes messages from brokers.
Producer : writes/produces messages to brokers.
Replication : Kafka replicates messages per partition; each partition can be configured with one or more replicas.
Leader : the unique replica that handles all read/write requests for a partition; other replicas sync from the leader, similar to MySQL binlog replication.
Broker : receives producer and consumer requests and persists messages to local disk. One broker in the cluster acts as the controller, handling leader elections and partition migrations.
ISR (In‑Sync Replica) : the subset of replicas that are alive and caught up with the leader. If a replica falls behind a configurable threshold, it is removed from the ISR.
These concepts form the core of Kafka’s design, which remains surprisingly simple despite its powerful throughput.
Broker Design
Unlike in‑memory queues such as Redis, Kafka writes every message to high‑capacity, low‑speed disks, but it does so without sacrificing performance by using sequential I/O only. Official tests on a RAID‑5, 7200 rpm setup show:
Sequence I/O: 600 MB/s
Random I/O: 100 KB/s
By restricting I/O to sequential access, Kafka avoids the performance penalty of random disk reads/writes.
Kafka heavily relies on the operating system’s page cache. Write operations are first placed into the page cache (marked dirty). Read operations first check the page cache, falling back to disk only on a miss. This approach eliminates an extra copy between user‑space and kernel‑space buffers and reduces GC pressure because data stays out of the JVM heap.
Advantages of using page cache instead of heap buffers:
JVM GC threads do not need to scan large heap regions, avoiding costly full GCs.
Objects in the heap carry object overhead, reducing effective memory utilization.
Only one copy of the data exists (in the OS page cache), effectively doubling usable cache space.
If a Kafka broker restarts, the OS page cache remains valid while in‑process caches are lost.
Beyond page cache, Kafka employs the Sendfile system call to eliminate unnecessary user‑space copies. The traditional network I/O flow involves four steps (disk → kernel page cache → user buffer → socket buffer → NIC buffer). Sendfile removes the two middle copies, reducing context switches and system calls.
Performance measurements from a 20‑broker cluster (75 partitions per broker, 110 k msg/s) show:
Write‑only phase: ~10 MB/s send traffic generated by inter‑broker replication, using asynchronous batch writes.
When read requests arrive, data can be served entirely from memory, achieving ~60 MB/s send traffic while disk reads stay below 50 KB/s.
After older data is flushed to disk, read traffic can rise to >40 MB/s, yet throughput remains stable because the OS pre‑fills the page cache for sequential reads.
Tips for Brokers
Avoid forcing log flushes via log.flush.interval.messages or log.flush.interval.ms; rely on replication for durability.
Tune /proc/sys/vm/dirty_background_ratio and /proc/sys/vm/dirty_ratio to control dirty‑page flushing.
When dirty‑page ratio exceeds the first threshold, the pdflush daemon starts flushing.
When it exceeds the second threshold, all writes are blocked until flushing completes.
Adjust the ratios based on workload requirements.
Partition Mechanics
Partitions enable horizontal scaling, high concurrency, and replication. Kafka can move partitions across brokers to balance load, and custom partitioning algorithms can route messages with the same key to the same partition.
Only one consumer in a consumer group can read a given partition at a time, which simplifies offset management and maximizes page‑cache hit rates, allowing near‑line‑rate read performance.
However, having too many partitions can increase leader election latency during broker failures. Each partition election takes ~10 ms; with 500 partitions, a full election could stall reads/writes for ~5 seconds.
If the controller broker fails, it must fetch metadata for every partition from ZooKeeper, which can take 30–50 seconds for 10 000 partitions.
Tips for Partitions
Pre‑allocate the expected number of partitions; adding partitions later may break key‑to‑partition mapping.
Keep replica count reasonable and spread replicas across different racks when possible.
Perform clean broker shutdowns to avoid prolonged recovery or data corruption.
Producer Optimizations
Since version 0.8 the producer is rewritten in Java, offering significant performance gains. Kafka groups multiple messages into a MessageSet , reducing per‑request RTT and allowing linear, sequential writes.
End‑to‑End compression (GZIP or Snappy) compresses data on the producer side, sends the compressed payload, and defers decompression until the consumer reads the message, minimizing network bandwidth usage.
However, MessageSet introduces a reliability trade‑off: the producer may consider a send successful before the batch is actually transmitted, risking data loss if the producer crashes.
Acknowledgement settings control this risk: acks=0: fire‑and‑forget, highest throughput, possible loss. acks=1: leader must acknowledge; replicas sync asynchronously. acks=-1 (or all): all in‑sync replicas must acknowledge, providing strongest durability at the cost of higher latency.
Tips for Producers
Limit the number of producer threads, especially during mirroring or migration, to avoid message reordering.
Note that the default request.required.acks in 0.8 is 0.
Consumer Design
Consumers follow a conventional design. Using a consumer group, Kafka supports both publish‑subscribe and queue‑like consumption models.
The high‑level API depends on ZooKeeper and is easier to use but less performant. The low‑level API removes the ZooKeeper dependency, offering better performance and flexibility, but requires manual handling of errors, leader changes, and offset management.
Tips for Consumers
Prefer the low‑level API when you need custom error handling, especially for corrupted data caused by unclean broker shutdowns.
How Kafka Handles Massive Message Volumes
Kafka achieves high throughput by batching and compressing messages, reducing the number of GC‑inducing heap allocations.
Kafka Memory Buffer Pool
The Kafka producer maintains a single BufferPool instance. The total pool size equals the sum of used and available memory (configured via buffer.memory, default 32 MB). The pool contains a deque of reusable ByteBuffer objects, each sized by batch.size (default 16 KB).
When a record batch is needed, the producer checks whether a free buffer of sufficient size exists. If not, it may block until enough memory becomes available. Allocation code:
int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
ByteBuffer buffer = free.allocate(size, maxTimeToBlock);Deallocation returns the buffer to the free list only if its size matches poolableSize; otherwise the memory is added back to the available pool and reclaimed by the JVM GC:
public void deallocate(ByteBuffer buffer, int size) {
lock.lock();
try {
if (size == this.poolableSize && size == buffer.capacity()) {
buffer.clear();
this.free.add(buffer);
} else {
this.availableMemory += size;
}
Condition moreMem = this.waiters.peekFirst();
if (moreMem != null) moreMem.signal();
} finally {
lock.unlock();
}
}If most messages fit within batch.size and free buffers are available, the pool recycles buffers, dramatically reducing GC pressure. For larger messages, the producer falls back to heap allocation, which can trigger GC.
To minimize GC impact, adjust batch.size so that typical messages stay below the configured threshold.
Summary
Kafka’s use of an OS page cache, Sendfile, and a custom memory buffer pool allows the system to keep most data in memory, avoid costly heap allocations, and thus reduce JVM garbage‑collection pauses, resulting in higher throughput and lower latency for big‑data messaging workloads.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
