Big Data 20 min read

Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts

This article provides a comprehensive introduction to Apache Kafka, covering its distributed publish‑subscribe architecture, its key components such as brokers, topics, partitions, producers, consumers, and ZooKeeper, as well as its advantages, drawbacks, storage mechanisms, partition assignment strategies, and reliability guarantees for high‑throughput big‑data streaming.

Full-Stack Internet Architecture

Feb 1, 2021

Kafka Overview

Kafka is a distributed publish/subscribe message queue written in Scala, developed by the Apache Software Foundation to provide a high‑throughput, low‑latency platform for real‑time data processing.

Key Concepts

It consists of brokers, topics, partitions, producers, consumers, consumer groups, leaders, followers, replication, and offsets. Topics are logical queues; each topic is split into ordered partitions stored on multiple brokers. A leader handles reads and writes while followers replicate data for fault tolerance.

Advantages

Supports multiple producers and consumers, horizontal broker scaling, data replication, topic‑based categorisation, batch compression, disk‑based persistence, low CPU/memory/network overhead, cross‑data‑center replication, and high throughput with sub‑second latency.

Disadvantages

Batching prevents true real‑time delivery, only intra‑topic ordering is guaranteed, monitoring requires plugins, no transactions, possible duplicate consumption, and manual topic creation.

ZooKeeper Role

ZooKeeper registers brokers and topics, balances producer load, tracks consumer group offsets, and stores metadata for partition‑consumer relationships.

Message Flow

Producers push records to the leader of a partition; the leader writes to disk and replicates to followers based on the acks setting (0, 1, or all). Consumers pull data in batches, commit offsets, and can use pull‑timeout to avoid empty loops.

Storage Mechanism

Each partition is a log file with an accompanying .index file. Logs are segmented; each segment has its own .index and .log files named by the first message offset.

1 00000000000000000000.index
2 00000000000000000000.log
3 00000000000000170410.index
4 00000000000000170410.log
5 00000000000000239430.index
6 00000000000000239430.log

Kafka uses sequential disk writes, memory‑mapped files, and zero‑copy (DMA) to achieve millions of TPS.

Partition Assignment Strategies

RangeAssignor distributes partitions based on sorted consumer order; RoundRobinAssignor distributes evenly across consumers, with variations when subscription sets differ.

Reliability Guarantees

Provides at‑most‑once, at‑least‑once, and exactly‑once delivery semantics; exactly‑once can be approximated with idempotent producers ( enable.idempotence=true) combined with at‑least‑once acks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Big Data Streaming Message Queue Reliability

Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.