Big Data 20 min read

What Is Kafka? A Beginner’s Guide to Distributed Streaming and Messaging

Kafka is an open‑source, distributed streaming platform that uses a publish/subscribe message queue architecture to provide high‑throughput, fault‑tolerant real‑time data processing, featuring topics, partitions, replicas, consumer groups, and multiple APIs for producers, consumers, streams, connectors, and administration.

MaGe Linux Operations

Jun 20, 2023

What Is Kafka? A Beginner’s Guide to Distributed Streaming and Messaging

1. Overview of Kafka

1.1 Definition

Kafka is an open‑source streaming platform developed by the Apache Software Foundation.

Kafka is a distributed, publish/subscribe message queue primarily used in real‑time big‑data processing.

1.2 Message Queues

1.2.1 Traditional Message Queue Use Cases

1.2.2 Why Use a Message Queue

Decoupling : Allows independent scaling or modification of both sides as long as they adhere to the same interface contracts.

Redundancy : Persists messages until fully processed, preventing data loss; deletion occurs only after explicit acknowledgment.

Scalability : Adding processing capacity is easy by adding more consumers.

Flexibility & Peak Handling : Handles burst traffic without over‑provisioning resources.

Recoverability : Failure of a component does not bring down the whole system.

Order Guarantee : Guarantees ordering within a partition (Kafka ensures order per partition).

Buffering : Controls and optimizes data flow speed between producers and consumers.

Asynchronous Communication : Allows messages to be queued without immediate processing.

1.2.3 Two Message Queue Models

Point‑to‑Point (one‑to‑one, consumer pulls data, message removed after consumption). Producers send to a queue; a single consumer retrieves and consumes each message.

Publish/Subscribe (one‑to‑many, data published to a topic and delivered to all subscribers). Producers publish to a topic; multiple consumers subscribe and receive the same messages.

1.3 Kafka Architecture Diagram

Producer: client that sends messages to Kafka brokers.

Consumer: client that reads messages from Kafka brokers.

Consumer Group (CG): a set of consumers; each consumer in a group reads from different partitions.

Broker: a Kafka server; a cluster consists of multiple brokers, each hosting many topics.

Topic: a logical queue that categorizes messages.

Partition: an ordered, immutable sequence of messages within a topic; each message gets a unique offset.

Replica: copies of a partition for fault tolerance (leader + followers).

Leader: the replica that handles all reads and writes for its partition.

Follower: replicas that sync from the leader.

Offset: stored as offset.kafka files; the numeric identifier of a message within a partition.

2. Hello Kafka

2.1 Getting Started

Quickstart

Chinese quick‑start guide

2.2 Core Concepts (Official Translation)

Kafka is a distributed streaming platform that is partitioned, replicated, and coordinated by ZooKeeper. Its key strengths are real‑time processing of massive data streams for use cases such as Hadoop batch processing, low‑latency streaming, Storm/Spark pipelines, web logs, and messaging services.

Three Key Capabilities

Publish and subscribe to record streams, similar to a message queue.

Persist received record streams, providing fault tolerance.

Process received record streams.

Two Main Application Types

Build reliable real‑time data pipelines between systems.

Build real‑time stream applications that transform or react to data streams.

Key concepts:

Kafka runs as a cluster on one or more servers.

Messages are stored in topics, each record containing a key, value, and timestamp.

Five Core Kafka APIs

Producer API : publish record streams to one or more topics.

Consumer API : subscribe to topics and process incoming record streams.

Streams API : act as a stream processor, reading from topics and writing transformed data to other topics.

Connector API : build reusable producers/consumers to connect Kafka topics with external systems (e.g., relational databases).

Admin API : manage and inspect topics, brokers, and other Kafka objects (available in newer versions).

Kafka clients communicate with servers via a simple, high‑performance, language‑agnostic TCP protocol that is backward compatible. Java clients are provided, along with many language bindings.

Topics and Logs

A topic is a collection of records of the same category. Kafka maintains a partitioned log for each topic.

Each partition is an ordered, immutable sequence of messages. Every message receives a sequential offset . Kafka guarantees order only within a partition, not across partitions.

Kafka retains all published records regardless of consumption, with configurable retention policies (e.g., time‑based or size‑based) that delete old data. Performance is independent of stored data size.

Consumers track their position in the log via offsets, which they can reset to replay data or skip ahead.

Distribution

Partitions are spread across brokers; each partition can have multiple replica partitions for fault tolerance. One replica acts as the leader handling all reads/writes, while followers sync from the leader. If the leader fails, a follower is promoted.

Producer

Producers publish data to chosen topics and decide which partition each record belongs to, using round‑robin or key‑based partitioning.

Consumer

Consumers belong to a consumer group; each message is delivered to one consumer instance within the group. Instances can run on separate processes or machines.

If all instances share the same group, records are balanced among them; different groups receive all messages (broadcast).

Example: a two‑node Kafka cluster with a four‑partition topic and two consumer groups (A with two instances, B with four). Within a group, each partition is consumed by only one instance, ensuring exclusive offset tracking.

Kafka guarantees ordering only within a partition; to achieve global ordering, a topic must have a single partition.

Guarantees

Messages sent to a specific partition are appended in send order.

Consumers see records in the order they are stored in the log.

With a replication factor of N, the system tolerates up to N‑1 broker failures without data loss.

2.3 Kafka Use Cases

Messaging

Kafka can replace traditional message middleware, offering higher throughput, built‑in partitioning, replication, and fault tolerance for large‑scale messaging.

Website Behavior Tracking

Kafka reconstructs user‑behavior pipelines as real‑time publish/subscribe sources, enabling monitoring, real‑time processing, and feeding data into Hadoop or offline warehouses.

Metrics

Kafka aggregates statistics from distributed applications into centralized data feeds for monitoring.

Log Aggregation

Kafka serves as a log aggregation solution, centralizing server logs for low‑latency processing and supporting multiple data sources.

Stream Processing

Kafka streams pipelines ingest, transform, and publish data across topics. Since version 0.10.0.0, Kafka includes the powerful Kafka Streams library. Other open‑source stream processors such as Apache Storm and Apache Samza can also be used.

Event Sourcing

Kafka’s durable log makes it an excellent backend for event‑sourced applications that record state changes over time.

Commit Log

Kafka can act as an external commit log for distributed systems, aiding data replication and node recovery via log compaction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing kafka Message Queue Distributed Streaming

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.