Big Data 15 min read

Mastering Kafka: High‑Throughput Distributed Messaging Explained

This comprehensive guide introduces Kafka as a high‑throughput, distributed, publish‑subscribe messaging system, detailing its core concepts, architecture, features, replication, log management, reliability guarantees, and typical use cases such as log collection, real‑time analytics, and cross‑cluster mirroring.

Programmer DD

Mar 29, 2021

Mastering Kafka: High‑Throughput Distributed Messaging Explained

Kafka Introduction

Kafka is an open‑source, high‑throughput distributed messaging system originally developed by LinkedIn.

It provides a publish‑subscribe model that can be deployed on inexpensive PC servers to build large‑scale messaging platforms.

Kafka Overview

Kafka offers high throughput, low latency (processing hundreds of thousands of messages per second with millisecond latency), scalability through hot‑expansion, persistence, fault tolerance, and support for thousands of concurrent clients.

It supports both real‑time stream processing (e.g., Storm) and offline batch processing (e.g., Hadoop).

Kafka Features

High throughput, low latency: each topic can be split into multiple partitions; consumer groups consume partitions in parallel.

Scalability: clusters can be expanded without downtime.

Persistence and reliability: messages are written to local disks with replication.

Fault tolerance: the cluster can tolerate up to n‑1 node failures when there are n replicas.

High concurrency: thousands of clients can read and write simultaneously.

Supports both online and offline processing scenarios.

Kafka Use Cases

Typical scenarios include log collection, decoupling producers and consumers, user activity tracking, operational metric aggregation, stream processing (e.g., Spark Streaming, Storm), event sourcing, and integration with platforms such as FusionInsight.

Log collection: centralize logs from various services and expose them to consumers like Hadoop, HBase, or Solr.

Message system: decouple producers and consumers, provide buffering.

User activity tracking: capture web or app events for real‑time monitoring or offline analysis.

Operational metrics: collect distributed application data for alerts and reporting.

Stream processing: integrate with Spark Streaming or Storm.

Event sourcing.

Kafka’s role in FusionInsight.

Kafka Architecture and Features

Kafka Architecture

Kafka clusters consist of one or more brokers that handle data storage and request processing. Topics are divided into partitions, each backed by a log file on disk. Producers publish messages to brokers, while consumers read from them.

Broker : a server instance that stores data and serves client requests; clusters can scale horizontally.

Topic : a logical category for messages.

Partition : a sub‑division of a topic; each partition is an ordered, immutable sequence of messages stored in a log.

Producer : publishes messages to a broker.

Consumer : reads messages from a broker.

Consumer Group : a set of consumers that share the load of reading partitions.

Zookeeper : stores metadata, performs leader election, and coordinates the cluster.

Kafka Topics

Each topic acts as a queue; messages are stored in FIFO order. Multiple partitions enable parallelism and high throughput.

Kafka Partition

Partitions are stored as a series of segment files (data and index). Only the active segment is writable; older segments become read‑only. Offsets uniquely identify messages within a partition.

The number of partitions is configurable at topic creation.

Partition count determines the maximum parallelism for a consumer group.

Example: Consumer group A with two consumers can read from four partitions; group B with four consumers can read from the same four partitions.

Kafka Partition offset:

Messages are appended to the end of the log; each message’s position (offset) is a long integer that uniquely identifies it. Consumers track progress using offset, partition, and topic. Random reads are not supported.

Kafka Partition Replicas

Replication is performed per partition. Each partition has a leader replica and one or more follower replicas (In‑Sync Replicas, ISR). Followers pull data from the leader. If the leader fails, a follower is promoted, provided it is in sync.

Leader and Follower Data Synchronization

Followers use a ReplicaFetcher thread to pull batches of data from the leader, which greatly improves throughput. Producers and consumers interact only with the leader.

Kafka Logs

Each partition’s log is split into segments. When a segment reaches a configured size or age, a new segment is created. Segment files consist of a data file (.log) and an index file (.index). Indexes are kept in memory for fast look‑ups, and sparse storage reduces metadata size.

Kafka Log Cleanup

Two cleanup policies exist: delete (based on age or total size) and compact (retain only the latest value for each key).

Kafka Data Reliability

All messages are persisted to disk, and replication across partitions ensures durability. Delivery guarantees include:

At most once: possible loss, no duplicates.

At least once: no loss, possible duplicates.

Exactly once: no loss, no duplicates.

Key Kafka Processes

Write Process

Producers discover the leader for a given topic‑partition via Zookeeper, then send messages directly to that broker.

Custom partitioning functions can route messages based on keys.

Read Process

Consumers connect to the leader broker of the assigned partition and pull messages.

Kafka Directory Structure in Zookeeper

Role of Zookeeper in Kafka

Zookeeper stores metadata for the cluster, producers, and consumers, ensuring high availability.

It provides coordination for leader election and load balancing.

Components can operate statelessly while Zookeeper maintains subscription relationships.

Zookeeper Shell

Use zkCli to connect to a running Zookeeper instance and issue commands such as ls and get to retrieve Kafka metadata.

Kafka in Zookeeper

Shows the hierarchical directory layout used by Kafka within Zookeeper.

Kafka Cluster Mirroring

Kafka’s MirrorMaker tool enables cross‑cluster data replication by consuming from a source cluster and producing to a target cluster.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming Kafka Distributed Messaging

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.