Avoid Kafka Pitfalls: Deep Production Lessons on Throughput, Reliability, and Scaling
This article shares hands‑on production experience with Kafka, covering how to tune producer throughput, prevent message loss, control duplicate consumption, handle backlog, maintain ordering, understand rebalance, and explains why Kafka delivers such high performance.
Background
Integrating Kafka has become standard for high‑concurrency systems, but moving from a basic "it works" setup to a stable, high‑efficiency deployment involves many pitfalls. The author draws on multiple production projects to discuss message‑loss prevention, duplicate‑consumption control, performance‑bottleneck optimization, cluster‑operation strategies, and design considerations for topics, partitions, and replicas.
1. Boosting Producer Throughput
The producer sends messages to the broker using two threads: the main thread and a Sender thread. The main thread places records into a RecordAccumulator (a double‑ended queue), and the Sender thread continuously pulls records from the accumulator and forwards them to the broker.
Four key producer parameters can be tuned:
batch.size – maximum size of a batch sent to the buffer (default 16 KB). Increasing it raises throughput but may add latency.
linger.ms – time the sender waits for the batch to fill before sending (default 0 ms). In production, a value between 5 ms and 100 ms is recommended.
buffer.memory – total size of the RecordAccumulator buffer (default 32 MB). Raising it expands buffering capacity.
compression.type – compression algorithm for outgoing data (none, gzip, snappy, lz4, zstd). Using compression reduces network and storage load.
The author likens these settings to a shuttle service: waiting for enough passengers before departing (batch.size & linger.ms) and using larger vehicles (buffer.memory) or compact seating (compression) to move more people efficiently.
2. Preventing Message Loss
2.1 Producer‑side loss
Because the producer sends asynchronously, a call to producer.send(msg) returns immediately without guaranteeing delivery. The solution is to always use the callback‑enabled API producer.send(msg, callback) to confirm success.
Enabling retries (set to a value > 0) allows the producer to automatically retry transient network failures.
2.2 Broker‑side loss
Set acks=all so that the leader and all ISR replicas must acknowledge the write before the producer receives a response.
Configure the broker with unclean.leader.election.enable=false to prevent a lagging broker from becoming leader and causing data loss.
Use replication.factor>=3 and min.insync.replicas>1 to ensure multiple copies of each message are stored; the recommended relationship is replication.factor = min.insync.replicas + 1.
2.3 Consumer‑side loss
The consumer tracks its position with an offset. If processing fails after committing an offset, messages after that point are lost. The author advises disabling automatic offset commits ( enable.auto.commit=false) and performing manual commits.
Manual commits can be synchronous ( commitSync()) for final shutdown or asynchronous ( commitAsync()) during normal processing to avoid blocking. Example code:
try {
while (true) {
ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofSeconds(1));
process(records); // handle messages
kafkaConsumer.commitAsync(); // non‑blocking commit
}
} catch (Exception e) {
handle(e);
} finally {
try {
kafkaConsumer.commitSync(); // final blocking commit
} finally {
kafkaConsumer.close();
}
}A CommitFailedException may occur if a rebalance happens between poll and commit.
3. Avoiding Duplicate Consumption
Enable producer idempotence with enable.idempotence=true to prevent duplicate sends caused by retries.
On the consumer side, ensure that the processed offset matches the committed offset, optionally using unique keys or distributed locks to guarantee idempotent handling.
4. Handling Message Backlog
If consumer throughput is insufficient, increase the number of partitions for the topic and match the consumer count to the partition count.
If downstream processing is slow, raise the maximum number of records fetched per poll (e.g., from 500 to 1000) to keep consumption speed ahead of production.
5. Ensuring Message Order
Producer: avoid acks=0, disable retries, use synchronous sends, and wait for each send to succeed before sending the next.
Consumer: each partition is consumed by a single consumer instance, preserving order within that partition, though this sacrifices some throughput.
6. Understanding Rebalance
Rebalance is the process where all consumers in a group agree on partition assignment. During rebalance, no consumer can read messages, which impacts TPS.
Triggers include:
Increase in the number of partitions for a subscribed topic.
Change in the set of topics a group subscribes to.
Change in the number of consumer instances (scale‑out or unexpected failure).
The coordinator deems a consumer dead if it does not send heartbeats within session.timeout.ms (default 145 s). The consumer also has max.poll.interval.ms (default 5 min) that limits the time between successive poll() calls; exceeding it forces the consumer to leave the group and triggers a rebalance.
7. Why Kafka Is So Fast
Message partitioning – distributes load across many brokers.
Sequential disk I/O – leverages append‑only logs for fast reads/writes.
Page cache – keeps recent data in memory, turning disk access into memory access.
Zero‑copy – reduces context switches and data copying.
Message compression – lowers disk and network I/O.
Batch sending – aggregates messages to reduce network overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Shepherd Advanced Notes
Dedicated to sharing advanced Java technical insights, daily work snippets, and the power of persistent effort.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
