Why Simple Kafka Retries Fail and How to Build a Robust Message‑Failure Strategy
This article analyzes common Kafka consumer failure scenarios, explains why naïve retry‑topic or message‑skip approaches can break ordering and data consistency, and presents practical patterns—including error classification, in‑consumer backoff, hidden topics, and DLQ handling—to design resilient asynchronous microservice communication.
Cross‑Bounded‑Context Message Passing
Traditional microservice designs often start with a centralized model where each piece of data lives in a single service and other services make synchronous calls to retrieve it, leading to long call chains, single points of failure, and reduced team autonomy.
Modern architectures split communication into command handling (usually synchronous within a bounded context) and event handling (asynchronously published to Kafka for other bounded contexts to consume).
In the asynchronous model, a service (e.g., UserAccount ) publishes an event after creating or updating a user. Other contexts (e.g., Login ) consume the event to keep local state in sync.
When Things Go Wrong
Because Kafka is a distributed system, consumer failures are inevitable. The most common pain point is a consumer that cannot successfully process a message.
Identifying the Problem
Teams often overlook that message‑processing failures will happen; they must proactively design a strategy rather than reacting after data loss.
Why Unlimited Retries Aren’t Viable
Kafka’s default behavior retries the same message until it succeeds. If a message is permanently unprocessable, the consumer will block forever, halting downstream processing.
Why Skipping the Message Is Dangerous
Skipping works for commands (e.g., a failed HTTP POST) because the operation never occurred, but events represent facts that have already happened. Dropping them desynchronizes upstream and downstream services.
Popular Solution: Retry Topics
The retry‑topic pattern creates a chain of topics (retry‑1, retry‑2, …) with increasing back‑off delays. If a consumer fails, it publishes the message to the next retry topic and commits the offset, allowing the main consumer to continue.
Eventually, after the final retry attempt, the message is sent to a Dead Letter Queue (DLQ) for manual inspection.
Drawbacks of the Retry‑Topic Pattern
It treats all failures the same, ignoring the distinction between recoverable (e.g., temporary DB outage) and non‑recoverable (e.g., malformed payload) errors.
It can break ordering because messages for the same aggregate may be processed out of sequence when some are delayed in retry topics.
For example, a user name change from “Zoë” to “Zoiee” could be processed out of order, leaving downstream services with stale data.
When Retry Topics Are Acceptable
They work well for immutable record streams where ordering is not critical, such as website activity logs, ledger entries that do not require strict ordering, or ETL pipelines pulling from external sources.
Improving the Pattern
Classify Errors : Use a whitelist function isRecoverable(Throwable t) to separate transient from permanent failures.
In‑Consumer Retries for Recoverable Errors : Apply exponential back‑off and alert when a threshold is reached.
void processMessage(KafkaMessage km) {
try {
Message m = km.getMessage();
transformAndSave(m);
} catch (Throwable t) {
if (isRecoverable(t)) {
doWithRetry(m, Backoff.EXPONENTIAL, this::transformAndSave);
} else {
// handle non‑recoverable case
}
}
}Hidden Topics for Non‑Recoverable Errors : Immediately move such messages to a dedicated “hidden” topic (similar to a DLQ) and continue processing later after the consumer is fixed.
Preserve Ordering per Aggregate : Track which aggregates have hidden messages so that subsequent messages for the same aggregate are also hidden until the issue is resolved.
Assessing Tolerance for Inconsistency
Complex failure‑handling mechanisms increase system complexity and operational burden. Organizations must decide whether occasional data inconsistency is acceptable and, if not, invest in coordination mechanisms to achieve eventual consistency.
Key Takeaways
Understand Kafka’s guarantees around topics, partitions, and partition keys.
Distinguish between recoverable and non‑recoverable errors.
Apply design patterns such as bounded contexts and aggregates correctly.
Evaluate whether your use case tolerates out‑of‑order processing or data inconsistency.
When using retry topics, ensure they are limited to scenarios where ordering is irrelevant and the data is immutable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
