Big Data 16 min read

Comprehensive Kafka FAQ: Uses, Architecture, Offsets, and Partition Management

This article provides an extensive overview of Apache Kafka, covering its use cases, key concepts such as ISR, AR, HW, LEO, and LW, message ordering, the roles of partitioners, serializers and interceptors, producer and consumer client architecture, offset handling, multithreaded consumption, and topic partition management.

Big Data Technology & Architecture

Jul 23, 2020

Comprehensive Kafka FAQ: Uses, Architecture, Offsets, and Partition Management

Kafka Use Cases and Scenarios

Kafka functions as a message system offering decoupling, redundancy, traffic smoothing, buffering, asynchronous communication, scalability, and recoverability, while also guaranteeing message ordering and replay capabilities. Its persistent storage on disk enables it to serve as a long‑term data store when configured with permanent retention or log‑compaction.

Kafka also acts as a streaming platform, providing reliable sources for popular stream‑processing frameworks and offering a rich library of operations such as windowing, joins, transformations, and aggregations.

ISR, AR, and Their Scaling

All replicas of a partition form the Assigned Replicas (AR). The In‑Sync Replicas (ISR) are the subset of AR that stay sufficiently synchronized with the leader. The leader tracks follower lag and removes lagging followers from ISR; followers that catch up are added back. Only ISR members are eligible for leader election by default.

Key configuration parameters:

replica.lag.time.max.ms – maximum allowed follower lag (default 10 s).

unclean.leader.election.enable – whether to allow unclean leader election, which improves availability at the risk of data loss.

HW, LEO, LSO, LW Definitions

HW (High Watermark) marks the highest offset that consumers can read. LSO (Log Start Offset) is the starting offset of a log segment and can be changed via delete‑records operations. LEO (Log End Offset) indicates the next offset to be written; the smallest LEO among ISR replicas equals the partition’s HW. LW (Low Watermark) is the smallest logStartOffset among AR replicas.

Message Ordering in Kafka

Ordering is achieved through partitioning strategies: round‑robin, random, or key‑based ordering. When a key is set, all messages with the same key are routed to the same partition, preserving order within that partition.

List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
return Math.abs(key.hashCode()) % partitions.size();

Partitioner, Serializer, Interceptor and Their Processing Order

Serializer – converts objects to byte arrays (required).

Partitioner – assigns a partition when none is specified.

Interceptor – optional hooks before sending (producer) or after receiving (consumer).

Processing order: Interceptor → Serializer → Partitioner.

Kafka Producer Client Structure

The producer runs two threads: the main thread creates records and passes them through interceptors, serializers, and partitioners into the RecordAccumulator; the Sender thread drains the accumulator and sends batches to brokers.

Consumer Thread Model

KafkaConsumer is not thread‑safe; each thread must own its own consumer instance. Common patterns include one consumer per thread or a pool of consumers with a separate processing thread pool.

Consumer Groups and Offset Management

Consumer groups consist of one or more consumer instances sharing a group ID. Each partition is assigned to a single consumer within the group. Offsets are stored in the __consumer_offsets topic, and the committed offset is always offset + 1.

Duplicate and Lost Consumption Scenarios

Duplicate consumption can arise from rebalances, manual or automatic offset commits, and producer retries. Lost consumption may occur due to premature auto‑commit, fire‑and‑forget producer sends, unprocessed messages before a crash, or insufficient acks.

Topic Partition Management

Partitions can be increased using the kafka‑topics.sh --alter command, which triggers a rebalance for all consumer groups. Decreasing partitions is not supported because it would require complex data reshuffling and could break ordering and timestamp semantics.

When creating a topic, choose the partition count based on performance testing (e.g., kafka‑producer‑perf‑test.sh, kafka‑consumer‑perf‑test.sh) and consider the impact on throughput, latency, and fault tolerance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Kafka Message Queue Producer Partitioning

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.