Kafka Deep Dive: Core Concepts Every Architect Must Master to Prevent Outages

The article explains why merely “knowing how to use” Kafka is insufficient, detailing how offset commits, consumer acknowledgments, producer acks, and rebalance behavior affect reliability, and provides concrete code examples, risk scenarios, and configuration recommendations to prevent message loss and duplicate processing in production systems.

LuTiao Programming
LuTiao Programming
LuTiao Programming
Kafka Deep Dive: Core Concepts Every Architect Must Master to Prevent Outages

Incident illustration

A service restart caused downstream systems to repeatedly consume historic messages, rolling back order states and overwhelming the alarm system. The failure was not a Kafka outage but a misunderstanding of Kafka’s mechanics.

Reliability factors

How offsets are committed

How consumers confirm consumption

What happens during a rebalance

How the producer’s acknowledgment (acks) setting influences data durability

Producer acknowledgment modes

acks=0   // fire‑and‑forget, no response
acks=1   // leader writes successfully, then returns
acks=all // all ISR replicas must write before returning

Mode trade‑offs: acks=0 gives lowest latency but no safety; acks=1 offers moderate latency but data can be lost if the leader crashes before replication; acks=all provides the highest safety at the cost of higher latency.

Payment system example

ProducerRecord<String, String> record = new ProducerRecord<>("payment-topic", orderId, payload);
producer.send(record);

With acks=1, if the leader writes the record but crashes before the follower replicates, the subsequent leader election may omit the record. The producer receives a success response, yet the message is lost.

Safer producer configuration

acks=all
retries=3
enable.idempotence=true

This configuration forces writes to all in‑sync replicas, enables automatic retries, and prevents duplicate writes.

Offset management

Kafka only tracks the offset that the application commits. The default enable.auto.commit=true makes the client commit offsets periodically, independent of business‑logic processing.

Risk scenario with automatic commits

Consumer pulls messages

Offset is auto‑committed

Business processing fails

Service restarts

Result: the committed offset is not re‑processed, so the message is lost.

Manual offset commit (production‑grade)

try {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        process(record);
    }
    consumer.commitSync();
} catch (Exception e) {
    // do not commit offset, allow retry
}

Rebalance handling

A rebalance is triggered when a new consumer joins, a consumer crashes, or topic partitions change. During rebalance the application should commit current offsets to avoid duplicate consumption.

consumer.subscribe(List.of("order-topic"), new ConsumerRebalanceListener() {
    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // commit current offsets
        consumer.commitSync();
    }
    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // optionally adjust offsets per business needs
    }
});

Consumer semantics

At most once : commit offset before processing; possible message loss.

At least once : process then commit offset; possible duplicate processing.

Exactly once : use idempotent producer and transactions; no loss and no duplication.

Recommended pattern

Combine “at least once” processing with an idempotent design.

public void process(String orderId) {
    if (repository.exists(orderId)) {
        return;
    }
    repository.save(orderId);
}

Project layout example

/opt/app/kafka-consumer/
├── bin/
├── config/
│   └── application.yml
├── logs/
└── lib/

Unified package name:

package com.icoderoad.kafka.consumer;

Illustrative diagrams

Offset commit flow:

Offset commit flow diagram
Offset commit flow diagram

Rebalance process diagram:

Rebalance diagram
Rebalance diagram

Takeaway

Kafka’s difficulty lies in mastering offset commits, rebalance handling, producer acknowledgment settings, and idempotent design. Correctly handling these mechanisms under failure conditions yields true system resilience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaMessage reliabilityIdempotenceProducer acksConsumer offsetRebalance
LuTiao Programming
Written by

LuTiao Programming

LuTiao Programming is a friendly community offering free programming lessons. We inspire learners to explore new ideas and technologies and quickly acquire job-ready skills.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.