Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts
This article provides a comprehensive introduction to Apache Kafka, covering its distributed publish‑subscribe architecture, its key components such as brokers, topics, partitions, producers, consumers, and ZooKeeper, as well as its advantages, drawbacks, storage mechanisms, partition assignment strategies, and reliability guarantees for high‑throughput big‑data streaming.
Kafka Overview
Kafka is a distributed publish/subscribe message queue written in Scala, developed by the Apache Software Foundation to provide a high‑throughput, low‑latency platform for real‑time data processing.
Key Concepts
It consists of brokers, topics, partitions, producers, consumers, consumer groups, leaders, followers, replication, and offsets. Topics are logical queues; each topic is split into ordered partitions stored on multiple brokers. A leader handles reads and writes while followers replicate data for fault tolerance.
Advantages
Supports multiple producers and consumers, horizontal broker scaling, data replication, topic‑based categorisation, batch compression, disk‑based persistence, low CPU/memory/network overhead, cross‑data‑center replication, and high throughput with sub‑second latency.
Disadvantages
Batching prevents true real‑time delivery, only intra‑topic ordering is guaranteed, monitoring requires plugins, no transactions, possible duplicate consumption, and manual topic creation.
ZooKeeper Role
ZooKeeper registers brokers and topics, balances producer load, tracks consumer group offsets, and stores metadata for partition‑consumer relationships.
Message Flow
Producers push records to the leader of a partition; the leader writes to disk and replicates to followers based on the acks setting (0, 1, or all). Consumers pull data in batches, commit offsets, and can use pull‑timeout to avoid empty loops.
Storage Mechanism
Each partition is a log file with an accompanying .index file. Logs are segmented; each segment has its own .index and .log files named by the first message offset.
1 00000000000000000000.index
2 00000000000000000000.log
3 00000000000000170410.index
4 00000000000000170410.log
5 00000000000000239430.index
6 00000000000000239430.logKafka uses sequential disk writes, memory‑mapped files, and zero‑copy (DMA) to achieve millions of TPS.
Partition Assignment Strategies
RangeAssignor distributes partitions based on sorted consumer order; RoundRobinAssignor distributes evenly across consumers, with variations when subscription sets differ.
Reliability Guarantees
Provides at‑most‑once, at‑least‑once, and exactly‑once delivery semantics; exactly‑once can be approximated with idempotent producers ( enable.idempotence=true) combined with at‑least‑once acks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
