Big Data 18 min read

Preventing Data Loss in Kafka: Message Semantics, Failure Scenarios, and Reliability Solutions

This article explains Kafka's message delivery semantics, analyzes potential data‑loss scenarios across producer, broker, and consumer components, and provides concrete configuration and coding practices—such as idempotent producers, proper ACK settings, replication factors, and manual offset commits—to maximize message durability and reliability.

Wukong Talks Architecture

Sep 2, 2022

Preventing Data Loss in Kafka: Message Semantics, Failure Scenarios, and Reliability Solutions

01 Overview

More and more Internet companies use message queues to support core business, requiring the highest possible guarantee that messages are not lost during transmission; data loss leads to user complaints and performance penalties.

Will Kafka lose data? If it does, how can we solve the problem? Understanding these issues helps us handle production‑grade Kafka failures and provide more stable services.

02 Message Delivery Semantics

Before diving into loss scenarios, we need to understand the concept of "message delivery semantics" in Kafka, which provides guarantees between producers and consumers. There are three semantics:

1) When a producer sends data to a broker and the commit succeeds, the replica mechanism ensures the message is not lost. However, if a network interruption occurs after the producer sends data, the producer cannot determine whether the message was committed, leading to at‑least‑once semantics. 2) Before Kafka 0.11.0.0, if the producer did not receive a commit response it would resend the message, potentially causing duplicate logs. Since 0.11.0.0, the producer supports idempotent delivery (enable.idempotence=true) which prevents duplicate entries and enables exactly‑once semantics for multi‑partition transactions. 3) From the consumer side, the offset is maintained by the consumer itself. If the consumer crashes after updating the offset, a new consumer will resume from the committed offset, resulting in at‑most‑once semantics (possible loss, no duplication). 4) If the consumer updates the offset only after processing the message and then crashes, the new consumer will re‑process the same message, leading to at‑least‑once semantics (no loss, possible duplication).

In summary, Kafka provides at‑least‑once semantics by default; exactly‑once requires additional configuration.

03 Message Loss Scenarios

Producer Side

The producer interacts directly with the leader partition. Loss can occur due to network jitter, oversized messages, or using fire‑and‑forget send calls (acks=0). The producer can be configured to use callbacks, enable idempotence, and adjust ACK settings to reduce loss.

Network issues: jitter may prevent the broker from receiving the message.

Message size: messages exceeding broker limits are rejected.

Configuration examples:

Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback);
public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
    // intercept the record, which can be potentially modified; this method does not throw exceptions
    ProducerRecord<K, V> interceptedRecord = this.interceptors == null ? record : this.interceptors.onSend(record);
    return doSend(interceptedRecord, callback);
}

Key producer settings:

Set acks=-1 (or all) so that all in‑sync replicas must acknowledge the write.

Enable idempotence: enable.idempotence=true.

Increase retries (e.g., Integer.MAX_VALUE) and retry.backoff.ms (recommended 300 ms) to handle transient failures.

Broker Side

Broker stores data in the page cache and flushes to disk asynchronously. If a broker crashes before the cache is flushed, data may be lost. Kafka does not provide synchronous flush, so loss is possible at the single‑broker level.

Reliability is achieved through multi‑partition, multi‑replica design. Important broker configurations: unclean.leader.election.enable=false – prevents a lagging follower from becoming leader. replication.factor>=3 – ensures at least two replicas remain after a leader failure. min.insync.replicas>1 – requires more than one replica to acknowledge before a write is considered committed.

Consumer Side

Consumers pull messages and then commit offsets. Data loss can happen when offsets are committed before processing (auto‑commit) and the application crashes, or when offsets are not committed and the consumer restarts, causing duplicate processing.

Recommended consumer settings:

Disable automatic offset commits: enable.auto.commit=false and commit offsets manually after successful processing.

Implement idempotent business logic to handle possible duplicate processing.

04 Solutions

Producer Solutions

Replace fire‑and‑forget calls with Producer.send(record, callback) to detect failures. Use acks=-1, enable idempotence, set high retries, and tune retry.backoff.ms. Ensure replication.factor>=2 and min.insync.replicas>1 on the broker side.

Broker Solutions

Configure unclean.leader.election.enable=false, increase replication.factor (recommended ≥3), and set min.insync.replicas>1. Ensure replication.factor > min.insync.replicas (e.g., replication.factor = min.insync.replicas + 1) to maximize availability.

Consumer Solutions

Use manual offset commits after processing ( enable.auto.commit=false). Ensure business logic is idempotent to tolerate possible duplicate deliveries.

05 Summary

This article covered three main points: (1) an overview of where data loss can occur in Kafka's architecture; (2) a deep dive into message delivery semantics, clarifying that only "committed" messages enjoy the strongest durability guarantees; (3) detailed failure scenarios for producer, broker, and consumer components together with practical configuration and coding recommendations to achieve high reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kafka Reliability consumer Broker producer Data loss Message Semantics

Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.