What Makes Kafka the Backbone of Real‑Time Big Data Processing?
This article provides a comprehensive overview of Apache Kafka, covering its distributed architecture, key advantages and drawbacks, the role of ZooKeeper, message delivery semantics, partitioning strategies, storage mechanisms, and performance optimizations such as zero‑copy and batch processing, all essential for high‑throughput real‑time data pipelines.
1 Kafka Introduction
1.1 Overview
Apache Kafka is a distributed publish/subscribe message queue written in Scala, designed by the Apache Software Foundation to provide a high‑throughput, low‑latency platform for real‑time data processing.
1.2 Advantages
Supports multiple producers and consumers.
Horizontal scalability of brokers.
Replication ensures data redundancy and prevents loss.
Topic‑based data classification.
Batch compression reduces transmission overhead.
Persistent storage on disk.
Sub‑millisecond latency under large‑scale workloads.
Consumers can subscribe to multiple topics.
Low CPU, memory, and network consumption.
Cross‑data‑center replication and mirroring.
1.3 Disadvantages
Batch sending prevents true real‑time delivery.
Only intra‑partition ordering is guaranteed.
Monitoring requires additional plugins.
Potential data loss and lack of transactional support.
Possible duplicate consumption and out‑of‑order messages.
1.4 Architecture
Broker : a Kafka server; a cluster consists of multiple brokers.
Producer : client that publishes messages to brokers.
Consumer : client that pulls messages from brokers.
Topic : logical queue that producers write to and consumers read from.
Partition : ordered sub‑log of a topic, enabling scalability.
Replication : each partition has a leader and one or more followers for fault tolerance.
Leader : the replica that handles reads and writes.
Follower : replicates data from the leader.
Consumer Group : a set of consumers sharing the consumption of a topic.
Offset : the position of a consumer within a partition.
1.5 ZooKeeper Role
ZooKeeper manages metadata for Kafka, providing broker registration, topic‑partition mapping, producer load balancing, and consumer offset tracking.
2 Kafka Production Process
2.1 Write Method
Producers use a push model, appending each message sequentially to a partition, which yields throughput improvements of three orders of magnitude over random writes.
2.2 Partition
2.2.1 Partition Overview
Each topic consists of multiple ordered partition logs; every message receives a unique offset.
2.2.2 Partition Assignment Principles
If a partition is specified, the producer uses it directly.
If no partition but a key is provided, the key's hash modulo the number of partitions determines the target.
If neither is provided, a round‑robin integer is generated and modulo‑ed by the partition count.
2.3 File Storage Mechanism
Each partition is stored as a pair of .index and .log files. To avoid oversized log files, Kafka splits logs into segments, each with its own index and log files named after the first message offset.
1 00000000000000000000.index
2 00000000000000000000.log
3 00000000000000170410.index
4 00000000000000170410.log
5 00000000000000239430.index
6 00000000000000239430.log2.4 Ensuring Message Order
Guarantee that all messages of a key go to the same partition.
Consume from a single thread per partition.
Use keys to enforce ordering.
4 Data Reliability
4.1 Message Delivery Semantics
at most once : messages may be lost but never duplicated.
at least once : messages are never lost but may be duplicated.
exactly once : messages are delivered once without loss or duplication.
4.2 Producer‑to‑Broker Flow
Producer discovers the leader for the target partition via ZooKeeper.
Producer sends the message to the leader.
Leader persists the message and synchronizes with followers based on the configured acks.
Followers acknowledge the write to the leader.
Leader replies to the producer with the final ack.
The acks configuration determines reliability: acks=0: fire‑and‑forget (lowest latency, possible loss). acks=1: default; waits for leader acknowledgment only. acks=-1 or acks=all: waits for all in‑sync replicas to acknowledge.
4.2.1 Idempotence
Enabling enable.idempotence=true gives the producer a unique PID and sequence numbers, allowing the broker to deduplicate repeated messages within a partition.
4.3 Broker Persistence Modes
sync : data is flushed to disk before acknowledging.
async : acknowledgment occurs after data reaches the OS page cache, risking loss on crash.
4.4 Consumer Offset Management
Consumers commit offsets after processing messages; committing before processing risks data loss on failure, while committing after processing may cause duplicate consumption if the commit fails.
5 Partition Assignment Strategies
RangeAssignor : default; partitions are divided sequentially among consumers.
RoundRobinAssignor : distributes partitions evenly across consumers by cycling through them.
6 High‑Performance Read/Write
6.1 Sequential I/O
Sequential disk access minimizes seek time and rotational latency, offering orders‑of‑magnitude faster throughput than random I/O.
6.2 Memory‑Mapped Files
Virtual memory maps file pages directly into the process address space, allowing the OS to handle paging between memory and disk efficiently.
6.3 Zero‑Copy
Zero‑Copy uses Direct Memory Access (DMA) to transfer data between disk, kernel buffers, and network interfaces without CPU copying, halving latency compared to traditional paths.
6.4 Batch Delivery
Kafka delivers messages to consumers in batches, reducing network overhead and increasing TPS, though true real‑time processing may still rely on downstream stream processors such as Flink.
7 References
Kafka partitioning discussion: https://www.zhihu.com/question/28925721
Disk read fundamentals: https://blog.csdn.net/holybin/article/details/21175781
Kafka achieving millions of TPS: https://mp.weixin.qq.com/s/Fb1cW0oN7xYeb1oI2ixtgQ
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
