Backend Development 12 min read

Ensuring Reliable Message Delivery with Kafka: Preventing Message Loss

This article explains how to use a message queue like Kafka to decouple systems and control traffic, identifies the three main points where message loss can occur—producer, broker, and consumer—and provides practical detection methods and configuration recommendations to guarantee reliable, loss‑free message delivery.

Top Architect
Top Architect
Top Architect
Ensuring Reliable Message Delivery with Kafka: Preventing Message Loss

The article introduces the purpose of adding a message queue (MQ) such as Kafka: system decoupling and traffic shaping, while highlighting the new challenge of data consistency.

Why use MQ? It isolates upstream and downstream changes and smooths traffic spikes, but introduces potential message loss across three stages: producer, broker, and consumer.

Three key questions: How to know a message is lost? Which parts may lose messages? How to ensure messages are never lost?

Detection methods: feedback from PM/operations, monitoring alerts (Kafka cluster anomalies, broker crashes, disk issues), and manual log inspection.

Case study: In a sentiment‑analysis pipeline, messages are pulled from Kafka, processed by an NLP NER service, and indexed into Elasticsearch. The pipeline stopped consuming after a while due to Rebalance logs and HTTP 500 errors from the NER service.

Producer side issues: The send() method buffers messages; network glitches or mis‑configuration (missing acks , no retries) cause loss. Solutions include setting retries=10 , using acks=all , and adding callbacks.

producer.send(new ProducerRecord<>(topic, messageKey, messageStr), new CallBack(){...});

Broker side issues: Messages first land in the OS PageCache; if the broker crashes after writing to cache, data is safe, but a power loss can discard it. Recommendations: use battery‑backed cache, run Kafka ≥ 0.11 (epoch support), set replication.factor≥3 , min.insync.replicas>1 , and disable unclean.leader.election.enable .

Consumer side issues: Message backlog, auto‑commit, and heartbeat timeouts cause loss. Mitigations: disable auto.commit , use manual ack, increase session.timeout.ms , tune max.poll.records and max.poll.interval.ms , and upgrade client to ≥ 0.10.2.

Operational commands for monitoring:

# View message count of a topic
$ ./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic test_topic
# List consumer groups
$ ./kafka-consumer-groups.sh --list --bootstrap-server 192.168.88.108:9092
# Describe consumer group offsets
$ ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group console-consumer-1152 --describe

Broker log‑flush configuration examples:

# Recommended defaults (let OS decide when to flush)
log.flush.interval.messages=10000   # flush after 10k messages
log.flush.interval.ms=1000         # flush every 1 second
flush.messages.flush.ms=1000       # topic‑level flush every 1 second
flush.messages=1                  # flush every message

Checklist to achieve 100 % reliability:

Producer: enable retries, set acks=all , add send callbacks.

Broker: use backup‑powered cache, run Kafka ≥ 0.11, configure replication ≥ 3, min.insync.replicas>1 , disable unclean leader election.

Consumer: upgrade client, disable auto‑commit, use manual ack, increase consumption speed, adjust poll settings.

By mastering these configurations and monitoring practices, engineers can significantly reduce the risk of message loss in Kafka‑based distributed systems.

distributed systemsMonitoringKafkadata consistencyMessage QueueReliability
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.