Operations 15 min read

Avoid Kafka Pitfalls: Deep Production Lessons on Throughput, Reliability, and Scaling

This article shares hands‑on production experience with Kafka, covering how to tune producer throughput, prevent message loss, control duplicate consumption, handle backlog, maintain ordering, understand rebalance, and explains why Kafka delivers such high performance.

Shepherd Advanced Notes

Apr 28, 2025

Avoid Kafka Pitfalls: Deep Production Lessons on Throughput, Reliability, and Scaling

Background

Integrating Kafka has become standard for high‑concurrency systems, but moving from a basic "it works" setup to a stable, high‑efficiency deployment involves many pitfalls. The author draws on multiple production projects to discuss message‑loss prevention, duplicate‑consumption control, performance‑bottleneck optimization, cluster‑operation strategies, and design considerations for topics, partitions, and replicas.

1. Boosting Producer Throughput

The producer sends messages to the broker using two threads: the main thread and a Sender thread. The main thread places records into a RecordAccumulator (a double‑ended queue), and the Sender thread continuously pulls records from the accumulator and forwards them to the broker.

Four key producer parameters can be tuned:

batch.size – maximum size of a batch sent to the buffer (default 16 KB). Increasing it raises throughput but may add latency.

linger.ms – time the sender waits for the batch to fill before sending (default 0 ms). In production, a value between 5 ms and 100 ms is recommended.

buffer.memory – total size of the RecordAccumulator buffer (default 32 MB). Raising it expands buffering capacity.

compression.type – compression algorithm for outgoing data (none, gzip, snappy, lz4, zstd). Using compression reduces network and storage load.

The author likens these settings to a shuttle service: waiting for enough passengers before departing (batch.size & linger.ms) and using larger vehicles (buffer.memory) or compact seating (compression) to move more people efficiently.

2. Preventing Message Loss

2.1 Producer‑side loss

Because the producer sends asynchronously, a call to producer.send(msg) returns immediately without guaranteeing delivery. The solution is to always use the callback‑enabled API producer.send(msg, callback) to confirm success.

Enabling retries (set to a value > 0) allows the producer to automatically retry transient network failures.

2.2 Broker‑side loss

Set acks=all so that the leader and all ISR replicas must acknowledge the write before the producer receives a response.

Configure the broker with unclean.leader.election.enable=false to prevent a lagging broker from becoming leader and causing data loss.

Use replication.factor>=3 and min.insync.replicas>1 to ensure multiple copies of each message are stored; the recommended relationship is replication.factor = min.insync.replicas + 1.

2.3 Consumer‑side loss

The consumer tracks its position with an offset. If processing fails after committing an offset, messages after that point are lost. The author advises disabling automatic offset commits ( enable.auto.commit=false) and performing manual commits.

Manual commits can be synchronous ( commitSync()) for final shutdown or asynchronous ( commitAsync()) during normal processing to avoid blocking. Example code:

try {
    while (true) {
        ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofSeconds(1));
        process(records); // handle messages
        kafkaConsumer.commitAsync(); // non‑blocking commit
    }
} catch (Exception e) {
    handle(e);
} finally {
    try {
        kafkaConsumer.commitSync(); // final blocking commit
    } finally {
        kafkaConsumer.close();
    }
}

A CommitFailedException may occur if a rebalance happens between poll and commit.

3. Avoiding Duplicate Consumption

Enable producer idempotence with enable.idempotence=true to prevent duplicate sends caused by retries.

On the consumer side, ensure that the processed offset matches the committed offset, optionally using unique keys or distributed locks to guarantee idempotent handling.

4. Handling Message Backlog

If consumer throughput is insufficient, increase the number of partitions for the topic and match the consumer count to the partition count.

If downstream processing is slow, raise the maximum number of records fetched per poll (e.g., from 500 to 1000) to keep consumption speed ahead of production.

5. Ensuring Message Order

Producer: avoid acks=0, disable retries, use synchronous sends, and wait for each send to succeed before sending the next.

Consumer: each partition is consumed by a single consumer instance, preserving order within that partition, though this sacrifices some throughput.

6. Understanding Rebalance

Rebalance is the process where all consumers in a group agree on partition assignment. During rebalance, no consumer can read messages, which impacts TPS.

Triggers include:

Increase in the number of partitions for a subscribed topic.

Change in the set of topics a group subscribes to.

Change in the number of consumer instances (scale‑out or unexpected failure).

The coordinator deems a consumer dead if it does not send heartbeats within session.timeout.ms (default 145 s). The consumer also has max.poll.interval.ms (default 5 min) that limits the time between successive poll() calls; exceeding it forces the consumer to leave the group and triggers a rebalance.

7. Why Kafka Is So Fast

Message partitioning – distributes load across many brokers.

Sequential disk I/O – leverages append‑only logs for fast reads/writes.

Page cache – keeps recent data in memory, turning disk access into memory access.

Zero‑copy – reduces context switches and data copying.

Message compression – lowers disk and network I/O.

Batch sending – aggregates messages to reduce network overhead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Configuration Kafka Reliability Throughput consumer producer Rebalance

Written by

Shepherd Advanced Notes

Dedicated to sharing advanced Java technical insights, daily work snippets, and the power of persistent effort.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.