Big Data 22 min read

Apache Kafka Overview, Architecture, and Sample Producer/Consumer Code

This article provides a comprehensive overview of Apache Kafka, comparing it with ActiveMQ, explaining its distributed architecture, topics, partitions, consumption models, high‑availability mechanisms, exactly‑once semantics, and includes detailed Java producer and consumer code examples for practical implementation.

Architecture Digest
Architecture Digest
Architecture Digest
Apache Kafka Overview, Architecture, and Sample Producer/Consumer Code

1. Overview

Apache Kafka, originally open‑sourced by LinkedIn and now an Apache project, is one of the most widely used distributed messaging systems, marketed as a "distributed streaming platform" since version 0.9.

Key differences from traditional message queues:

Kafka is a distributed system that scales easily.

It provides high throughput for both publishing and subscribing.

Supports multiple consumers with automatic load‑balancing on failures.

Messages are persisted.

Comparison with ActiveMQ:

Feature

Kafka

ActiveMQ

Background

High‑performance, distributed log‑based system for log collection, stream processing, and message distribution.

JMS‑compliant enterprise messaging middleware.

Development Language

Java, Scala

Java

Protocol Support

Custom protocol

JMS

Persistence

Supported

Supported

Transaction Support

Supported since 0.11.0

Supported

Producer Fault Tolerance

Configurable ack levels (0, 1, -1) with possible duplicate data.

Retry on failure with ack model.

Throughput

High, using batch processing, zero‑copy, and O(1) disk operations.

Load Balancing

Managed via ZooKeeper; producers discover brokers and can target specific partitions.

2. Getting Started

2.1 Producer

Java example that creates a KafkaProducer with typical configuration properties and continuously sends numbered messages.

import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;

public class UserKafkaProducer extends Thread {
    private final KafkaProducer
producer;
    private final String topic;
    private final Properties props = new Properties();

    public UserKafkaProducer(String topic) {
        props.put("metadata.broker.list", "localhost:9092");
        props.put("bootstrap.servers", "master2:6667");
        props.put("retries", 0);
        props.put("batch.size", 16384);
        props.put("linger.ms", 1);
        props.put("buffer.memory", 33554432);
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        producer = new KafkaProducer<>(props);
        this.topic = topic;
    }

    @Override
    public void run() {
        int messageNo = 1;
        while (true) {
            String messageStr = "Message_" + messageNo;
            System.out.println("Send:" + messageStr);
            producer.send(new ProducerRecord<>(topic, messageStr));
            messageNo++;
            try { Thread.sleep(3000); } catch (InterruptedException e) { e.printStackTrace(); }
        }
    }
}

2.2 Consumer

Java example that configures a KafkaConsumer , subscribes to topics, polls for records, and prints offset, key, and value.

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

KafkaConsumer
consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));

while (true) {
    ConsumerRecords
records = consumer.poll(100);
    for (ConsumerRecord
record : records) {
        System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value());
    }
}

3. Kafka Architecture Principles

Key questions addressed include how topics and partitions are stored, advantages of Kafka's consumption model, and how distributed storage and retrieval are achieved.

3.1 Architecture Diagram

3.2 Terminology

Name

Explanation

Broker

Message‑handling node; a Kafka cluster consists of one or more brokers.

Topic

Logical channel for categorising messages; each message must specify a topic.

Producer

Client that publishes messages to brokers.

Consumer

Client that reads messages from brokers.

ConsumerGroup

Set of consumers sharing the same group id; only one consumer in a group processes a given message.

Partition

Physical subdivision of a topic; each partition maintains an ordered log.

3.3 Topic and Partition

Each message belongs to a topic, which can have multiple partitions. Partitions store logs in append‑only files; offsets provide a unique, monotonically increasing identifier within a partition, guaranteeing order only at the partition level.

Message routing to partitions:

No key → round‑robin distribution.

With key → hash(key) % number_of_partitions, ensuring the same key always lands in the same partition.

3.4 Consumption Model

Kafka uses a pull‑based model where consumers control the read rate and can seek to arbitrary offsets, unlike push‑based systems that may lose messages on consumer failure.

3.5 Network Model

3.5.1 KafkaClient – Single‑Threaded Selector

Suitable for low‑concurrency scenarios.

3.5.2 Kafka Server – Multi‑Threaded Selector

The server uses an acceptor thread plus thread pools for read/write, allowing high concurrency.

3.6 High‑Reliability Distributed Storage

Kafka achieves durability through log segmentation and sparse indexing, storing each partition as a series of .log and .index files. Reads use binary search on the index to locate the correct segment, then scan the log.

3.6.1 Log Segments

Segments are sized (e.g., 100 messages) and named by the offset of the first message they contain. Sparse indexes keep memory usage low while enabling fast lookups.

3.6.2 Replication Mechanism

Each partition has a leader replica and zero or more follower replicas. The ISR (In‑Sync Replicas) set contains replicas that are up‑to‑date and connected to ZooKeeper; only these can become the new leader on failure.

4. High‑Availability Model and Idempotence

Kafka supports three delivery semantics:

at‑least‑once : May deliver duplicates; requires idempotent processing.

at‑most‑once : May lose messages if the producer does not retry.

exactly‑once : Guarantees a single delivery using transactions and sequence numbers (available from 0.11.0).

4.1 Implementing Exactly‑Once

4.1.1 Single Producer, Single Topic

Each producer gets a unique PID; messages carry a monotonically increasing sequence number. The broker validates the sequence, rejecting out‑of‑order or duplicate messages.

4.1.2 Transactions

Producers provide a Transaction ID; the Transaction Coordinator persists transaction state in an internal topic. Transactions ensure atomicity across multiple partitions and enable recovery after failures.

5. Interview Questions Summary

Typical interview questions derived from the article include why use a message queue, Kafka's storage model, consumption advantages, partitioning rationale, log segmentation, high‑availability mechanisms, and how to design a custom message queue.

Source: https://mp.weixin.qq.com/s/vhwUCdimvpBt5Z38pRX5xw
big dataKafkaConsumerproducerDistributed Messagingexactly-once
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.