Backend Development 24 min read

Why Simple Retries Fail in Kafka and How to Build Robust Failure Strategies

This article explains Kafka's core concepts, the challenges of consumer failures in microservice architectures, why naïve retry loops or message skipping are insufficient, and presents a nuanced approach that distinguishes recoverable from unrecoverable errors, using back‑off retries and hidden topics to preserve ordering and data integrity.

Programmer DD

May 14, 2021

Why Simple Retries Fail in Kafka and How to Build Robust Failure Strategies

Kafka Overview

Apache Kafka has become the dominant platform for asynchronous communication between microservices, offering powerful features for building robust, resilient architectures. However, using Kafka also introduces potential pitfalls that can lead to data loss or corruption if not addressed early.

Key Concepts

Kafka consists of three basic components:

An event log where messages are published.

A publisher that writes messages to the log.

A consumer that reads messages from the log.

Unlike traditional queues such as RabbitMQ, Kafka uses a pull model where each consumer tracks an offset to know which messages have been processed.

Topics

Logs are divided into topics . Each topic defines the type of events it carries and should have a single schema.

Partitions and Partition Keys

Topics are further split into partitions to enable parallel consumption. A partition key (often a UUID or other identifier) deterministically assigns a message to a partition, ensuring that all events for a given aggregate are ordered.

Using Kafka in Microservices

Microservices often start with a centralized model where each piece of data has a single source of truth. Synchronous calls to that source create long call chains, single points of failure, and reduced team autonomy.

Modern architectures separate command handling (usually synchronous) from event handling (asynchronous). Services emit events to Kafka, and other bounded contexts consume them.

Cross‑Boundary Event Publishing

When a service updates an aggregate, it publishes an event with the aggregate ID as the partition key. This guarantees that all changes for the same aggregate land in the same partition, preserving order.

What to Do When Problems Occur?

Consumer failures are inevitable. The article examines common misconceptions:

Can we just keep retrying the same message?

Default Kafka behavior retries the same message until it succeeds, but some errors are permanent. Endless retries block subsequent messages and can corrupt data.

Can we simply skip the failing message?

Skipping works for commands (which can be retried by the caller) but not for events, which represent facts that have already occurred. Dropping events leads to state divergence between services.

How to Solve the Problem?

When Are Retry Topics Acceptable?

They are suitable when ordering is irrelevant, such as processing website activity streams for reporting, adding transactions to a ledger without strict ordering, or ETL jobs that ingest data from another source.

Improved Approach

Distinguish between recoverable and unrecoverable errors:

Recoverable errors (e.g., temporary database outage) can be retried with exponential back‑off.

Unrecoverable errors (e.g., malformed payload) should be stashed in a hidden topic and the consumer should continue processing subsequent messages.

Example Java pseudocode for error classification:

void processMessage(KafkaMessage km) {<br/>  try {<br/>    Message m = km.getMessage();<br/>    transformAndSave(m);<br/>  } catch (Throwable t) {<br/>    if (isRecoverable(t)) {<br/>      // retry with back‑off<br/>    } else {<br/>      // stash to hidden topic<br/>    }<br/>  }<br/>}

For recoverable errors, apply a back‑off strategy until the external resource recovers:

void processMessage(KafkaMessage km) {<br/>  try {<br/>    Message m = km.getMessage();<br/>    transformAndSave(m);<br/>  } catch (Throwable t) {<br/>    if (isRecoverable(t)) {<br/>      doWithRetry(m, Backoff.EXPONENTIAL, this::transformAndSave);<br/>    } else {<br/>      // stash to hidden topic<br/>    }<br/>  }<br/>}

When an unrecoverable error occurs, move the message directly to a hidden (or DLQ) topic instead of cycling through multiple retry topics.

Preserving Order

Only messages belonging to the same aggregate need strict ordering. After fixing the consumer, process hidden messages for that aggregate before resuming the main consumer, ensuring the correct sequence.

Handling Hidden Messages

Deploy a dedicated hidden‑topic consumer that processes stashed messages after the consumer code has been fixed. Once the hidden consumer finishes, switch back to the main consumer.

Should We Accept Some Inconsistency?

Complex solutions may be hard to build, test, and maintain. Organizations must assess their tolerance for temporary data inconsistency and consider eventual‑consistency mechanisms where appropriate.

Conclusion

Retry handling in Kafka is inherently complex because it must reconcile the platform’s elegance with the realities of distributed failure. Effective solutions—whether retry topics, hidden topics, or internal back‑off retries—must respect ordering, differentiate error types, and align with the specific use‑case requirements of the system.

Key Takeaways

Understand Kafka’s topics, partitions, and partition keys.

Distinguish recoverable from unrecoverable errors.

Apply design patterns such as bounded contexts and aggregates.

Evaluate whether your workload can tolerate data inconsistency.

Choose a failure‑handling strategy that matches your ordering and consistency needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Microservices Kafka Retry

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.