Big Data 23 min read

What Makes Kafka the Backbone of Real‑Time Big Data Processing?

This article provides a comprehensive overview of Apache Kafka, covering its distributed architecture, key advantages and drawbacks, the role of ZooKeeper, message delivery semantics, partitioning strategies, storage mechanisms, and performance optimizations such as zero‑copy and batch processing, all essential for high‑throughput real‑time data pipelines.

Su San Talks Tech

Dec 28, 2021

What Makes Kafka the Backbone of Real‑Time Big Data Processing?

1 Kafka Introduction

1.1 Overview

Apache Kafka is a distributed publish/subscribe message queue written in Scala, designed by the Apache Software Foundation to provide a high‑throughput, low‑latency platform for real‑time data processing.

1.2 Advantages

Supports multiple producers and consumers.

Horizontal scalability of brokers.

Replication ensures data redundancy and prevents loss.

Topic‑based data classification.

Batch compression reduces transmission overhead.

Persistent storage on disk.

Sub‑millisecond latency under large‑scale workloads.

Consumers can subscribe to multiple topics.

Low CPU, memory, and network consumption.

Cross‑data‑center replication and mirroring.

1.3 Disadvantages

Batch sending prevents true real‑time delivery.

Only intra‑partition ordering is guaranteed.

Monitoring requires additional plugins.

Potential data loss and lack of transactional support.

Possible duplicate consumption and out‑of‑order messages.

1.4 Architecture

Broker : a Kafka server; a cluster consists of multiple brokers.

Producer : client that publishes messages to brokers.

Consumer : client that pulls messages from brokers.

Topic : logical queue that producers write to and consumers read from.

Partition : ordered sub‑log of a topic, enabling scalability.

Replication : each partition has a leader and one or more followers for fault tolerance.

Leader : the replica that handles reads and writes.

Follower : replicates data from the leader.

Consumer Group : a set of consumers sharing the consumption of a topic.

Offset : the position of a consumer within a partition.

1.5 ZooKeeper Role

ZooKeeper manages metadata for Kafka, providing broker registration, topic‑partition mapping, producer load balancing, and consumer offset tracking.

2 Kafka Production Process

2.1 Write Method

Producers use a push model, appending each message sequentially to a partition, which yields throughput improvements of three orders of magnitude over random writes.

2.2 Partition

2.2.1 Partition Overview

Each topic consists of multiple ordered partition logs; every message receives a unique offset.

2.2.2 Partition Assignment Principles

If a partition is specified, the producer uses it directly.

If no partition but a key is provided, the key's hash modulo the number of partitions determines the target.

If neither is provided, a round‑robin integer is generated and modulo‑ed by the partition count.

2.3 File Storage Mechanism

Each partition is stored as a pair of .index and .log files. To avoid oversized log files, Kafka splits logs into segments, each with its own index and log files named after the first message offset.

1 00000000000000000000.index
2 00000000000000000000.log
3 00000000000000170410.index
4 00000000000000170410.log
5 00000000000000239430.index
6 00000000000000239430.log

2.4 Ensuring Message Order

Guarantee that all messages of a key go to the same partition.

Consume from a single thread per partition.

Use keys to enforce ordering.

4 Data Reliability

4.1 Message Delivery Semantics

at most once : messages may be lost but never duplicated.

at least once : messages are never lost but may be duplicated.

exactly once : messages are delivered once without loss or duplication.

4.2 Producer‑to‑Broker Flow

Producer discovers the leader for the target partition via ZooKeeper.

Producer sends the message to the leader.

Leader persists the message and synchronizes with followers based on the configured acks.

Followers acknowledge the write to the leader.

Leader replies to the producer with the final ack.

The acks configuration determines reliability: acks=0: fire‑and‑forget (lowest latency, possible loss). acks=1: default; waits for leader acknowledgment only. acks=-1 or acks=all: waits for all in‑sync replicas to acknowledge.

4.2.1 Idempotence

Enabling enable.idempotence=true gives the producer a unique PID and sequence numbers, allowing the broker to deduplicate repeated messages within a partition.

4.3 Broker Persistence Modes

sync : data is flushed to disk before acknowledging.

async : acknowledgment occurs after data reaches the OS page cache, risking loss on crash.

4.4 Consumer Offset Management

Consumers commit offsets after processing messages; committing before processing risks data loss on failure, while committing after processing may cause duplicate consumption if the commit fails.

5 Partition Assignment Strategies

RangeAssignor : default; partitions are divided sequentially among consumers.

RoundRobinAssignor : distributes partitions evenly across consumers by cycling through them.

6 High‑Performance Read/Write

6.1 Sequential I/O

Sequential disk access minimizes seek time and rotational latency, offering orders‑of‑magnitude faster throughput than random I/O.

6.2 Memory‑Mapped Files

Virtual memory maps file pages directly into the process address space, allowing the OS to handle paging between memory and disk efficiently.

6.3 Zero‑Copy

Zero‑Copy uses Direct Memory Access (DMA) to transfer data between disk, kernel buffers, and network interfaces without CPU copying, halving latency compared to traditional paths.

6.4 Batch Delivery

Kafka delivers messages to consumers in batches, reducing network overhead and increasing TPS, though true real‑time processing may still rely on downstream stream processors such as Flink.

7 References

Kafka partitioning discussion: https://www.zhihu.com/question/28925721

Disk read fundamentals: https://blog.csdn.net/holybin/article/details/21175781

Kafka achieving millions of TPS: https://mp.weixin.qq.com/s/Fb1cW0oN7xYeb1oI2ixtgQ

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming Distributed Messaging

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.