Backend Development 9 min read

How to Prevent Duplicate Messages in Kafka and Pulsar: A Practical Guide

This article explains the three message delivery semantics, the common causes of duplicate messages in queue systems, and presents concrete producer‑side, broker‑side, and consumer‑side deduplication techniques for Kafka and Pulsar, including code samples and best‑practice recommendations.

Architect
Architect
Architect
How to Prevent Duplicate Messages in Kafka and Pulsar: A Practical Guide

Three Message Delivery Semantics

At Least Once: messages are never lost and are consumed at least once, but duplicates may occur.

Exactly Once: each message is consumed precisely once, without loss or duplication.

At Most Once: messages are never duplicated, but loss may happen.

Different scenarios require different semantics; for example, Exactly Once is the hardest to achieve and usually needs transactional messaging.

What Causes Message Duplication?

When a producer sends a message, the broker may store it successfully but fail to return an ACK; the producer assumes failure and retries, causing the broker to store the same message twice, leading to multiple consumptions.

If the consumer processes a message but the ACK back to the broker fails, the broker does not advance the offset, so the same message can be delivered again.

Producer‑Side Deduplication (Kafka Idempotent Producer)

<code>Properties props = new Properties();
// ... other configurations
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
KafkaProducer&lt;String, String&gt; producer = new KafkaProducer&lt;&gt;(props);
</code>

Kafka achieves idempotence by assigning each producer a unique Producer ID (PID) and a monotonically increasing Sequence Number. The broker stores for each message; if a duplicate is detected, the broker discards it. This works only within a single partition.

Broker‑Side Deduplication (Pulsar)

Pulsar enables deduplication via the BrokerDeduplicationEnabled parameter. When a producer sends a duplicate message, the broker returns a -1:-1 response.

Producers include a sequenceId field; the broker records the highest sequenceId per ProducerName . Upon receiving a message, the broker compares the incoming sequenceId with the stored highest value: if greater, the message is accepted and the highest value updated; otherwise the message is discarded and -1:-1 is returned.

Producer disconnects: after reconnection, the locally stored sequenceId remains; sending with an incremented sequenceId works.

Producer crash: after restart, the cached sequenceId is lost; the broker sends the stored highest sequenceId to the producer, which then resumes sending from that value.

Both producer and broker crash: the broker restores state from snapshots, but the latest sequenceId may be missing if the crash occurred before snapshotting.

Similar to Kafka’s idempotent producer, Pulsar’s broker‑side deduplication is only guaranteed at the Topic/Partition level.

Consumer‑Side Deduplication

Because producer and broker deduplication work only per partition, consumer‑side deduplication is often necessary. Assign a globally unique ID to each message payload; the consumer checks this ID to decide whether the message has been processed.

One approach is to store the ID in a database with a unique index; another is to use Redis SETNX to atomically record the ID.

<code>if (jedis.setnx(ID, "1") == 1) {
    // process business logic, return ACK
} else {
    // duplicate detected, return ACK without processing
}
</code>

Summary

Message queues can provide loss‑avoidance and duplicate‑avoidance features, but they are not foolproof; in scenarios sensitive to duplicate messages, it is best to implement deduplication at the consumer side, using business‑level identifiers such as unique IDs stored in a database or Redis.

backendKafkaMessage Queueconsumer deduplicationduplicate handlingidempotencePulsar
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.