Understanding Kafka Architecture: Topics, Partitions, Producers, Consumers, Offsets, and Transactions
Kafka is a high‑performance distributed message queue that uses topics, partitions, and replicas for scalability, with producers writing messages, consumers reading them via consumer groups, and advanced features such as offset management, delivery semantics, rebalancing, and exactly‑once transactional processing, all coordinated by Zookeeper.
Kafka is a distributed message queue designed for high throughput, persistence, replication, and horizontal scalability. Producers write messages to topics, which are divided into multiple partitions; each partition can have several replicas and a designated leader that handles all reads and writes.
Consumers subscribe to topics via consumer groups. Within a group, each partition is consumed by only one consumer, ensuring that a partition’s messages are processed by a single consumer at a time while allowing multiple groups to read the same data.
Kafka stores metadata such as broker information, topics, and partitions in ZooKeeper. A special component called the Controller, elected via ZooKeeper, manages partition assignment and leader election. When a broker fails, the Controller re‑elects leaders for affected partitions.
Offset tracking originally used ZooKeeper, but since version 0.10 it is stored in an internal consumer_offsets topic. Offsets are keyed by group ID, topic, and partition, and the topic uses a compacted cleanup policy to retain only the latest offset per key.
Kafka provides three delivery semantics: at‑most‑once (messages may be lost but never duplicated), at‑least‑once (messages are never lost but may be duplicated), and exactly‑once (no loss and no duplication, supported when both producer and consumer are Kafka).
Exactly‑once semantics rely on idempotent producers and transactional processing. Producers are assigned a unique PID and maintain a monotonically increasing sequence number; the broker only accepts a message if its sequence matches broker_seq + 1 . Transactions are identified by a user‑provided TID and coordinated by a Transaction Coordinator, which records transaction states in a compacted log.
During a transaction, the producer writes data to target topics, then sends a Prepare‑Commit or Prepare‑Abort marker. Once all participants acknowledge, a final Commit or Abort marker makes the messages visible or discarded, ensuring atomicity across multiple topics and offset updates.
Kafka stores data on disk as log segments within each partition. Each segment has a .log file and accompanying .index files (offset index and time index). Indexes are sparse, storing base offset and file position pairs to enable fast lookups without loading entire logs into memory.
Key configuration parameters include broker settings (e.g., replication factor, log segment size), topic settings (e.g., cleanup policy, retention size/time), and consumer settings (e.g., heartbeat interval, session timeout). Proper tuning of these parameters is essential for performance, durability, and resource utilization.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.