Why Kafka Messages Get Lost and How to Prevent It
This article explains the three places where Kafka can lose messages—broker, producer, and consumer—detailing the underlying causes such as page cache flushing, acknowledgment settings, buffer handling, and offset commits, and provides practical configuration and design strategies to minimize data loss and improve reliability.
Kafka can lose messages in three components: Broker, Producer, and Consumer.
Broker
Message loss at the broker occurs because Kafka stores data asynchronously in the Linux page cache for performance. Data in the page cache is flushed to disk based on time, size thresholds, or explicit sync/fsync calls. If the system crashes before flushing, the data is lost.
The flushing triggers are:
Explicitly calling sync or fsync
Low available memory
Dirty data reaching a time threshold
Kafka does not provide a synchronous flush mechanism; RocketMQ implements it by blocking until the flush completes:
GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
service.putRequest(request);
boolean flushOK = request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); // flushTo reduce loss, adjust flush parameters such as interval and batch size, balancing performance and reliability.
The number of acknowledgments the producer requires the leader to have received before considering a request complete. Settings: acks=0 – No acknowledgment, highest throughput, no guarantee. acks=1 – Leader writes to its log and acknowledges without waiting for followers; loss can occur if the leader fails before replication. acks=all (or acks=-1 ) – Leader waits for all in‑sync replicas to acknowledge, providing the strongest durability guarantee.
Kafka also uses the min.insync.replicas setting to require a minimum number of in‑sync replicas; without it, the effective guarantee falls back to acks=1.
Producer
Producer loss happens when buffered messages are not persisted before a crash or when the buffer overflows. Kafka batches requests in memory to improve throughput, but if the process stops abruptly, the buffered data is lost.
Mitigation strategies include:
Switching from asynchronous to synchronous sends or limiting thread pool size to control production rate.
Increasing the buffer size (though this does not eliminate loss).
Writing messages to local storage (e.g., a database or file) before sending, adding an extra durable layer.
Consumer
Consumer loss can occur during message processing and offset committing. Automatic offset commits happen at fixed intervals, which may commit offsets before processing succeeds, leading to loss if processing fails.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
insertIntoDB(record); // may exceed commit interval
}Manual offset commits ensure that offsets are only advanced after successful processing, providing at‑least‑once delivery but potentially causing duplicate consumption.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "false");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
final int minBatchSize = 200;
List<ConsumerRecord<String, String>> buffer = new ArrayList<>();
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
buffer.add(record);
}
if (buffer.size() >= minBatchSize) {
insertIntoDb(buffer);
consumer.commitSync();
buffer.clear();
}
}Low‑level APIs allow explicit offset control for even finer handling.
In summary, Kafka’s reliability depends on configuring flush policies, acknowledgment levels, buffer sizes, and commit strategies; higher reliability typically reduces throughput.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
