Operations 13 min read

How to Guarantee Zero Message Loss with Kafka: Best Practices and Configurations

This article explains why introducing a message queue like Kafka helps decouple systems and control traffic, then dives into the three key questions of detecting, locating, and preventing message loss, offering concrete monitoring methods, configuration settings, and troubleshooting steps for producers, brokers, and consumers.

IT Architects Alliance

Sep 6, 2022

How to Guarantee Zero Message Loss with Kafka: Best Practices and Configurations

Introduction

Message queues (e.g., Kafka) decouple systems and shape traffic but introduce data‑consistency risks. Ensuring that no messages are lost requires detection, understanding loss points, and applying reliable configurations.

Key Questions

How to detect lost messages?

Which components can cause loss?

How to guarantee loss‑free delivery?

Detecting Message Loss

Typical signals:

Operational feedback – reports of missing data.

Monitoring alerts – Kafka metrics such as broker failures, disk errors, consumer lag, or offset gaps.

Potential Loss Points

Message flow: producer → broker → consumer . Each stage can lose messages.

Producer Side

The send() call buffers records and forwards them to the Sender thread. Common loss scenarios and mitigations:

Network instability : timeouts leave records in the client buffer. Mitigation: enable retries, e.g. props.put("retries", "10").

Insufficient acknowledgments : missing acks or callbacks. Mitigation: set acks=all (or acks=-1) and provide a Callback to handle failures.

Typical acks settings: acks=0 – fire‑and‑forget, highest loss risk. acks=1 – leader writes, still vulnerable if leader fails. acks=all (or acks=-1) – wait for all in‑sync replicas, safest.

Broker Side

When a broker receives a batch it first writes to the OS page cache ( PageCache) and later flushes to disk asynchronously. Risks:

If the broker crashes after the cache write, data remains in memory and is not lost.

If the machine loses power, cached data is lost.

Mitigation: use UPS‑backed servers and enable replication.

Recommended replication and ISR settings (Kafka ≥ 0.11):

replication.factor=3

min.insync.replicas=2

Disable unclean.leader.election.enable Typical broker flush configuration (let OS decide when to flush for performance):

log.flush.interval.messages=10000   # flush after 10k messages
log.flush.interval.ms=1000        # flush every second
flush.messages=1                  # flush every message (topic‑level)

Consumer Side

Loss scenarios and remedies:

Message backlog : partitions accumulate unprocessed records. Remedy: increase consumer throughput, process records in parallel threads.

Auto‑commit : offsets are committed before processing finishes; a crash causes skipped records. Remedy: disable enable.auto.commit and commit offsets manually after successful processing.

Heartbeat timeout / rebalance : client evicted from group, stopping consumption. Remedy: tune max.poll.records, max.poll.interval.ms, and upgrade client to ≥ 0.10.2 where heartbeats are decoupled from poll().

Case Study: Sentiment‑Analysis Pipeline

A pipeline reads Kafka messages, performs NER via an NLP service, and indexes results into Elasticsearch. After a period, consumption stopped due to repeated rebalance events and HTTP 500 errors from the NLP service. Root cause: client version 0.10.1 coupled heartbeats with poll(), causing timeouts.

Remedial actions:

Increase session.timeout.ms to 25000 ms.

Introduce a circuit‑breaker (e.g., Hystrix) to pause consumption after consecutive service failures.

Best‑Practice Checklist

Producer : enable retries, set acks=all, provide a Callback for failures.

Broker : run Kafka ≥ 0.11, use UPS‑backed power, configure replication.factor≥3, min.insync.replicas>1, and disable unclean.leader.election.enable.

Consumer : upgrade client to ≥ 0.10.2, disable auto‑commit, manually commit after processing, increase processing parallelism, and tune poll‑related parameters.

Useful Kafka Commands

Check topic offsets and consumer group status:

# Get message count for a topic
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic test_topic

# List consumer groups
./kafka-consumer-groups.sh --list --bootstrap-server localhost:9092

# Describe a consumer group
./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-group --describe

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Configuration kafka Data Consistency Message Queue Reliability consumer Broker producer

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.