Big Data 19 min read

Master Kafka: Core Concepts, Architecture, and Interview Essentials

This comprehensive guide explains why Kafka is the industry standard for real‑time data pipelines, compares it with RabbitMQ and RocketMQ, details its architecture, APIs, leader election, offset management, idempotence, rebalance handling, and provides practical interview questions and code examples for engineers.

Alibaba Cloud Developer

Feb 19, 2024

Master Kafka: Core Concepts, Architecture, and Interview Essentials

Why Kafka

Message queues provide asynchronous processing, peak‑shaving, and decoupling. For small‑to‑medium companies, RabbitMQ (open‑source, active community) is a good choice; large companies with strong infrastructure teams often use RocketMQ (Java‑centric). In big‑data scenarios such as real‑time analytics and log collection, Kafka is the de‑facto standard due to its active community and reliability.

RabbitMQ

Originally built for reliable telecom communication and one of the few products supporting the AMQP protocol.

Advantages

Lightweight, fast, easy to deploy.

Flexible routing configuration via exchange modules.

Client libraries for most programming languages supporting AMQP.

Disadvantages

Performance degrades sharply with large backlogs.

Not suitable for workloads requiring tens of thousands of messages per second.

Extension and secondary development are costly because it is written in Erlang.

RocketMQ

Adopts Kafka’s design with many improvements and offers almost all features a message queue should have.

Used for ordered, transactional, stream computing, message push, log processing, binlog distribution.

Proven performance and stability through multiple Double‑11 events.

Java‑centric development makes source reading, extension, and secondary development convenient.

Optimized for low latency in e‑commerce scenarios.

Handles tens of thousands of messages per second with millisecond‑level response.

Performance is an order of magnitude higher than RabbitMQ.

Supports dead‑letter queues (DLX) for handling failed messages and improving system reliability.

Disadvantages

Integration and compatibility with surrounding systems are not ideal.

Kafka

High Availability

Supported by almost all related open‑source software, suitable for most scenarios, especially big‑data and stream computing.

Efficient, scalable, persistent with partition, replication, and fault tolerance.

Designed for batch and asynchronous processing, delivering very high throughput.

Can process tens of thousands of asynchronous messages per second; with compression, up to 20 million messages per second.

Higher latency due to asynchronous and batch nature, less suitable for e‑commerce latency‑critical use cases.

What Kafka Provides

Producer API : Publish record streams to one or more topics.

Consumer API : Subscribe to topics and process the generated record streams.

Streams API : Act as stream processors, transforming input streams to output streams.

Message Basics

A Kafka message is analogous to a row in a database table.

Core Concepts

Topic : Logical grouping of messages, similar to a database table.

Partition : Topics are split into partitions distributed across the cluster for scalability; each partition is ordered.

Replica : Each partition has multiple replicas for fault tolerance.

Producer : Distributes messages across partitions (by key hash, explicit partition, or round‑robin).

Consumer : Uses offsets to track read messages; offsets are stored in Zookeeper or Kafka.

Consumer Group : Ensures each partition is consumed by only one consumer; rebalancing occurs when members change.

Broker (Node) : Connects producers and consumers; a single broker can handle thousands of partitions and millions of messages per second.

Leader Election : Each partition has a leader; followers replicate from the leader. ISR (in‑sync replica) set determines eligible leaders.

Offsets : Producer offset marks the latest position in a partition; consumer offset tracks read position, allowing independent offsets per group.

Log Segment : A partition consists of multiple log segments composed of .log, .index, and .timeindex files.

Idempotence and Exactly‑Once

Idempotence guarantees that re‑sent messages are not processed multiple times, ensuring final result consistency. It is achieved by adding a unique identifier (similar to a primary key) to each message.

ProducerID: # Assigned a unique PID when a new producer is initialized.
SequenceNumber: # Monotonically increasing per topic for each PID.

Rebalance Issues

Rebalancing can cause minutes‑long unavailability in large clusters. Causes include changes in group membership, subscription topics, or partition counts. Solutions involve increasing session.timeout.ms, heartbeat.interval.ms, and max.poll.interval.ms.

session.timeout.ms=6000
heartbeat.interval.ms=2000
max.poll.interval.ms=60000

ZooKeeper Role

Kafka uses ZooKeeper for metadata storage, member management, controller election, and other administrative tasks. Future versions (KIP‑500) will replace ZooKeeper with a Raft‑based controller.

Replica Mechanics

Only the leader replica serves read/write requests; followers pull data from the leader. Since Kafka 2.4, followers can optionally serve reads. Leader epoch mechanism improves consistency during leader changes.

Preventing Duplicate Consumption

Commit offsets after processing.

Use unique key constraints in MySQL combined with Redis to track consumed IDs.

Employ Bloom filters for high‑throughput scenarios.

Ensuring No Data Loss

Producer: Set acks=all for full acknowledgment.

Broker: Use ISR replication and retries.

Consumer: Disable auto‑commit and commit offsets only after successful processing.

Maintaining Order

Single topic, single partition, single consumer guarantees order (low throughput).

For per‑key ordering, assign each key to a dedicated in‑memory queue processed by a single thread.

Handling Message Backlog

Increase consumer parallelism.

Batch consumption.

Reduce I/O interactions.

Prioritize critical messages.

if (maxOffset - curOffset > 100000) {
    // TODO: prioritize backlog handling, possibly discard or log
    return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;
}
return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;

Designing a Message Queue

Key requirements: horizontal scalability (broker + partition), consistency, availability, partition fault tolerance, and handling massive data volumes. Techniques include time wheels, zero‑copy, I/O multiplexing, sequential read/write, and batch compression.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kafka

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.