Big Data 21 min read

Why Kafka Dominates Modern Data Pipelines: Architecture, Benefits, and Guarantees

Kafka, the open‑source distributed messaging system from LinkedIn, offers O(1) persistence, high throughput, partitioned topics, and flexible delivery guarantees, making it a cornerstone for modern big‑data pipelines and real‑time processing alongside Hadoop, Spark, and Storm.

21CTO

Feb 23, 2016

Why Kafka Dominates Modern Data Pipelines: Architecture, Benefits, and Guarantees

Abstract

Kafka is an open‑source distributed messaging system originally developed at LinkedIn. It offers high throughput, O(1) persistence, partitioned topics, and strong delivery guarantees, and integrates with Hadoop, Storm, and Spark.

Background

Creation Background

Kafka was built at LinkedIn to support activity streams and operational data pipelines, handling massive page‑view logs and server metrics that require scalable, low‑latency infrastructure.

Overview

Kafka is a distributed publish/subscribe system designed for constant‑time message persistence, high throughput (>100 K messages/s on cheap hardware), ordered partitions, offline and real‑time processing, and horizontal scalability.

Why Use a Message System

Decoupling – Allows independent evolution of producers and consumers.

Redundancy – Persists messages until they are fully processed.

Scalability – Simple to increase ingestion and processing rates.

Flexibility & Peak Handling – Handles traffic spikes without over‑provisioning.

Recoverability – Failure of a component does not halt the whole system.

Ordering Guarantees – Preserves order within a partition.

Buffering – Smooths differences in processing speeds.

Asynchronous Communication – Producers can fire‑and‑forget.

Comparison with Other Message Queues

RabbitMQ – Heavyweight, broker‑based, supports many protocols.

Redis – Key‑value store with lightweight queue capabilities; excels with small messages.

ZeroMQ – Fast, broker‑less, but lacks persistence.

ActiveMQ – Apache project offering broker and peer‑to‑peer modes.

Kafka / Jafka – High‑performance, O(1) persistence, horizontal scaling, integrates with Hadoop.

Kafka Architecture

Terminology

Broker – A server in a Kafka cluster.

Topic – A category of messages.

Partition – A physical log segment of a topic.

Producer – Publishes messages to brokers.

Consumer – Reads messages from brokers.

Consumer Group – A set of consumers that share the consumption of a topic.

Topology

A typical Kafka cluster consists of multiple producers, brokers, consumer groups, and a ZooKeeper ensemble that manages metadata and leader election.

Topic & Partition

Topics are logical queues; each topic is split into one or more partitions, each stored as a set of log segments. Every message receives a 64‑bit offset that determines its position.

Log entries consist of a 4‑byte length, a 1‑byte magic value, a 4‑byte CRC, and the payload. Segments are named by the first offset and have accompanying index files.

Kafka retains all messages (subject to time‑ or size‑based retention policies) rather than deleting consumed messages.

Producer Message Routing

Producers assign messages to partitions based on a key and the configured partitioner. The default num.partitions can be set in $KAFKA_HOME/config/server.properties. A custom partitioner class must implement kafka.producer.Partitioner.

import kafka.producer.Partitioner;
import kafka.utils.VerifiableProperties;

public class JasonPartitioner<T> implements Partitioner {
    public JasonPartitioner(VerifiableProperties verifiableProperties) {}
    @Override
    public int partition(Object key, int numPartitions) {
        try {
            return Math.abs(Integer.parseInt((String) key) % numPartitions);
        } catch (Exception e) {
            return Math.abs(key.hashCode() % numPartitions);
        }
    }
}

When the above partitioner is used, messages with the same key are sent to the same partition.

Consumer Group

With the high‑level API, a message in a topic can be consumed by only one consumer within a group, while multiple groups can read the same message, enabling both broadcast and unicast semantics.

Push vs. Pull

Kafka uses a push model for producers and a pull model for consumers. Pull allows consumers to control their own consumption rate, avoiding overload that can occur with push.

Delivery Guarantees

At most once

– Messages may be lost but never duplicated. At least once – No loss, possible duplicates. Exactly once – Each message is processed once and only once; requires external coordination.

By default Kafka provides “at least once” for producers and “exactly once” for consumer reads, though end‑to‑end exactly‑once semantics depend on how the application commits offsets and processes data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data kafka consumer producer Distributed Messaging Delivery Guarantees

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.