Big Data 17 min read

What Is Kafka? A Deep Dive into Distributed Streaming and Messaging

Kafka is an Apache‑hosted distributed streaming platform that provides high‑throughput, durable, publish‑subscribe messaging, originally developed by LinkedIn; this article explains its core concepts, message system classifications, architecture components, APIs, replication, consumer groups, and guarantees, comparing it with other messaging solutions.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
What Is Kafka? A Deep Dive into Distributed Streaming and Messaging

What is Kafka?

Kafka is a distributed streaming platform under Apache, a high‑throughput, durable, publish‑subscribe message queue system. It was originally released by LinkedIn, written in Scala, and open‑sourced in December 2010 as an Apache top‑level project. It is mainly used to process action‑stream data from large‑scale consumer websites.

Message System Classification

Common message systems include Kafka, RabbitMQ, ActiveMQ, etc., which use two patterns: Peer‑to‑Peer (Queue) and Publish/Subscribe (Topic).

Peer‑to‑Peer (Queue)

In the PTP queue model, a producer sends a message to a queue; a single consumer retrieves and consumes the message, after which the queue no longer stores it.

Terminology:

Producer=生产者
Queue=队列
Consumer=消费者

How PTP works:

Producer1 produces a message to a Queue, Consumer1 consumes it.

After consumption, the Queue no longer stores the message; other consumers cannot consume it.

Multiple producers can write to the same Queue, but each message is consumed by only one consumer.

If no consumer exists, the Queue retains the message until a consumer appears.

Publish/Subscribe (Topic)

In the publish/subscribe model, a publisher posts a message to a topic and all subscribers can consume it.

Terminology:

Publisher=发布者
Topic=主题
Subscriber=订阅者

How Pub/Sub works:

The Publisher publishes a message to a Topic; multiple Subscribers can consume it.

All Subscribers receive the message.

The Publisher does not receive an error if no subscriber exists.

A Publisher must exist before Subscribers.

Note: Kafka uses the publish/subscribe model.

Common Message System Comparison

RabbitMQ: Erlang‑based, supports multiple protocols (AMQP, XMPP, SMTP, STOMP), supports both PTP and Pub/Sub.

Redis: Key‑Value NoSQL database with lightweight MQ capabilities; better performance for short messages.

ZeroMQ: Lightweight library, PTP style, requires custom integration.

ActiveMQ: JMS implementation, PTP, supports persistence and XA transactions.

Kafka/Jafka: High‑performance, cross‑language distributed Pub/Sub system with persistence and both online and offline processing.

MetaQ/RocketMQ: Pure Java Pub/Sub system, supports local and XA distributed transactions.

Kafka Introduction

Three main characteristics:

High throughput : can handle millions of messages per second.

Durability : robust storage mechanism ensures data persistence.

Distributed : data is replicated across multiple servers for fault tolerance.

Key concepts:

Kafka runs as a cluster on one or more servers, possibly across data centers.

Records are stored in Topics, each divided into Partitions.

Each record consists of a key, value, and timestamp.

Core APIs (four):

Producer API – publishes records to one or more Topics.

Consumer API – subscribes to Topics and processes record streams.

Streams API – processes input streams from Topics and produces output streams.

Connector API – builds reusable producers or consumers to connect Topics with external systems.

Kafka Architecture Overview

Key components:

Producer : sends messages (Push) to a Broker’s Topic.

Broker : a Kafka node that creates Topics, stores messages, and persists them to disk.

Topic : logical category; a Topic contains one or more Partitions.

Partition : ordered, immutable sequence of records; each record has a unique offset.

Consumer : pulls messages from subscribed Topics.

ZooKeeper : maintains cluster state and coordinates high availability.

Producers do not interact directly with ZooKeeper; they obtain cluster metadata from the Brokers.

Topic and Log

Each Topic consists of one or more Partition logs. Partitions store ordered records; new records are appended with increasing offsets. Kafka retains all records on disk for a configurable retention period, regardless of consumption.

Distribution

Partitions are distributed across Brokers; each Partition has a leader that handles all reads and writes, while followers replicate the leader. If a leader fails, a follower is promoted.

Geo‑Replication

MirrorMaker provides cross‑data‑center replication, enabling active/passive backup and reducing latency for users.

Producers

Producers publish data to chosen Topics and decide which Partition to write to, using round‑robin or key‑based partitioning for load balancing.

Consumers

Consumers belong to Consumer Groups; records are load‑balanced among group members. Multiple groups can consume the same Topic independently.

Consumer Group

Consumer groups enable scaling and fault tolerance; partitions are dynamically assigned to group members, and rebalancing occurs when members join or leave.

Guarantees

Messages sent by a producer to a specific Partition are appended in order.

Consumers read records in the order stored in the log.

With a replication factor of N, Kafka tolerates up to N‑1 broker failures without data loss.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataKafkaPublish-SubscribeDistributed Streaming
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.