How to Prevent Kafka Message Loss in Critical Transaction Systems
This article explains why Kafka can lose messages in production, broker, and consumer stages, analyzes root causes such as asynchronous batch sends, JVM crashes, and network failures, and provides practical solutions including callbacks, retry mechanisms, replication settings, and manual offset commits to ensure reliable delivery.
If you list Kafka experience on your résumé, interviewers will likely ask, "How do you guarantee no message loss?" This article examines why Kafka can lose messages and how to prevent it in strict transaction scenarios.
Why Kafka Still Loses Messages
Even though Kafka is powerful, beginners often wonder why messages disappear. The core workflow involves producers, brokers, and consumers, each with potential loss points.
Producer-side Loss and Solutions
Messages are sent asynchronously in batches by the IO SEND thread, so the producer thread returns before confirming delivery. Using a callback function lets the IO thread report success or failure, allowing retries. However, in extreme cases—JVM crashes, long GC pauses, or network failures—callbacks alone cannot guarantee delivery.
To handle such cases, implement an external mechanism:
Record each message with a "pending" status in a durable store (e.g., a database).
Use the callback to mark the record as "sent" when the broker acknowledges.
Run a periodic task (e.g., every 5 minutes) to scan for unsent or failed records and resend them.
This ensures that even if the JVM or network fails, the send status is persisted and eventually the message is delivered.
Broker-side Loss and Solutions
After a broker acknowledges a producer, the message resides in the OS page cache before being flushed to disk. If the broker process crashes, the cache can be recovered; if the OS or machine crashes before flushing, data is lost.
Kafka mitigates this with replication:
Set acks=all on the producer so the leader waits for acknowledgment from a majority of replicas.
Configure min.insync.replicas ≥ 2 on the broker to require at least two in‑sync replicas.
Set replication.factor ≥ 3 for each topic partition to tolerate one replica failure.
Disable unclean.leader.election.enable to prevent out‑of‑sync replicas from becoming leaders.
Consumer-side Loss and Solutions
When autoCommit=true, the consumer commits offsets before processing messages. If processing fails or the JVM crashes, the committed offset skips the failed message, causing loss.
Fix this by disabling auto‑commit and committing offsets manually after successful processing:
while (true) {
consumer.poll(); // ① pull messages
// process message
consumer.commit(); // ② commit offset after processing
}This guarantees that a message is only marked as consumed after the application has successfully handled it.
Summary
Producer retries and callbacks reduce loss but cannot cover JVM or network failures; adding persistent send logs with scheduled retries ensures reliability. Broker replication with proper acks, min.insync.replicas, and replication.factor settings prevents data loss on the broker side. On the consumer side, disabling autoCommit and committing offsets manually eliminates consumer‑side loss.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
