Big Data 19 min read

Kafka Overview: Background, Core Concepts, Producer/Consumer Configuration, Core Principles, Operations, and Stream Processing

This article provides a comprehensive beginner-friendly guide to Apache Kafka, covering its background, core concepts, producer and consumer settings with code examples, underlying architecture, operational monitoring, integration with Spark and Flink, and an introduction to Kafka Streams.

Big Data Technology & Architecture

Aug 21, 2021

Kafka Overview: Background, Core Concepts, Producer/Consumer Configuration, Core Principles, Operations, and Stream Processing

You can refer to the previous articles:

"What Are We Actually Learning When Studying Flink?"

"What Are We Actually Learning When Studying Spark?"

Kafka Source Reading Tips

This article explains a learning path for Kafka from a beginner's perspective, covering background, core concepts, core principles, source code reading, and practical applications. It is intended as a roadmap; readers should dive deeper into the parts most relevant to their own needs.

Kafka Background

Kafka was developed and open‑sourced by LinkedIn as a distributed, high‑performance messaging engine. It has become the de‑facto data‑pipeline technology in the big‑data era, offering high reliability, high throughput, high availability, and scalability, backed by an active community.

Key roles of Kafka in the big‑data ecosystem:

Message system: decouples services, provides durable storage, traffic shaping, buffering, asynchronous communication, scalability, and recoverability.

Storage system: its persistence and replication allow Kafka to serve as a long‑term data store.

Streaming platform: offers a complete streaming library with windows, joins, transformations, and aggregations, functioning as a distributed stream‑processing platform.

Kafka Getting Started

Before using Kafka you should understand basic messaging concepts, Kafka terminology, role definitions, and version evolution.

Core concepts include:

Message (Record): the primary data unit processed by Kafka.

Topic: a logical container for messages, often used to separate business domains.

Partition: an ordered, immutable sequence of messages; a topic can have multiple partitions.

Offset: a monotonically increasing position identifier for each message within a partition.

Replica: copies of a partition’s log for redundancy; includes leader and follower replicas.

Producer: an application that publishes messages to a topic.

Consumer: an application that subscribes to a topic to read messages.

Consumer Offset: tracks each consumer’s progress within a partition.

Consumer Group: a set of consumer instances that jointly consume partitions to achieve high throughput.

Rebalance: automatic redistribution of partitions among consumers when a member leaves or joins the group.

ISR (In‑Sync Replica): the set of replicas that are currently in sync with the leader.

HW (High Watermark): the offset up to which all ISR replicas have received data; messages beyond HW are not visible to consumers.

LEO (Log End Offset): the offset of the last message in a replica’s log.

Kafka Producer and Consumer

The producer is responsible for sending messages to Kafka. Important configuration parameters include:

acks

max.request.size

retries and retry.backoff.ms

Key required properties (described in Chinese in the original source):

必选属性有3个：

bootstrap.servers：该属性指定broker的地址清单，地址的格式为host：port。清单里不需要包含所有的broker地址，生产者会从给定的broker里查询其他broker的信息。不过最少提供2个broker的信息，一旦其中一个宕机，生产者仍能连接到集群上。
key.serializer:生产者接口允许使用参数化类型，可以把Java对象作为键和值传broker，但是broker希望收到的消息的键和值都是字节数组，所以，必须提供将对象序列化成字节数组的序列化器。key.serializer必须设置为实现org.apache.kafka.common.serialization.Serializer的接口类，默认为

org.apache.kafka.common.serialization.StringSerializer，也可以实现自定义的序列化器。
value.serializer:同上。

可选参数：

acks：指定了必须要有多少个分区副本收到消息，生产者才会认为写入消息是成功的，这个参数对消息丢失的可能性有重大影响。

acks=0：生产者在写入消息之前不会等待任何来自服务器的响应，容易丢消息，但是吞吐量高。

acks=1：只要集群的首领节点收到消息，生产者会收到来自服务器的成功响应。如果消息无法到达首领节点（比如首领节点崩溃，新首领没有选举出来），生产者会收到一个错误响应，为了避免数据丢失，生产者会重发消息。不过，如果一个没有收到消息的节点成为新首领，消息还是会丢失。默认使用这个配置。

acks=all：只有当所有参与复制的节点都收到消息，生产者才会收到一个来自服务器的成功响应。延迟高。

buffer.memory:设置生产者内存缓冲区的大小，生产者用它缓冲要发送到服务器的消息。

max.block.ms:指定了在调用send()方法或者使用partitionsFor()方法获取元数据时生产者的阻塞时间。当生产者的发送缓冲区已满，或者没有可用的元数据时，这些方法就会阻塞。在阻塞时间达到max.block.ms时，生产者会抛出超时异常。

batch.size:当多个消息被发送同一个分区时，生产者会把它们放在同一个批次里。该参数指定了一个批次可以使用的内存大小，按照字节数计算。当批次内存被填满后，批次里的所有消息会被发送出去。

retries:指定生产者可以重发消息的次数。

receive.buffer.bytes和send.buffer.bytes:指定TCP socket接受和发送数据包的缓存区大小。如果它们被设置为-1，则使用操作系统的默认值。如果生产者或消费者处在不同的数据中心，那么可以适当增大这些值，因为跨数据中心的网络一般都有比较高的延迟和比较低的带宽。

linger.ms:指定了生产者在发送批次前等待更多消息加入批次的时间。

A typical producer example:

public class KafkaProducer {
    public static final String brokerList = "localhost:9092";
    public static final String topic = "topic-demo";

    public static Properties initConfig(){
        Properties props = new Properties();
        props.put("bootstrap.servers", brokerList);
        props.put("key.serializer",
                "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer",
                "org.apache.kafka.common.serialization.StringSerializer");
        props.put("client.id", "producer.client.id.demo");
        return props;
    }

    public static void main(String[] args) {
        Properties props = initConfig();
        KafkaProducer<String, String> producer = new KafkaProducer<>(props);
        ProducerRecord<String, String> record =
                new ProducerRecord<>(topic, "Hello, Kafka!");
        try {
            producer.send(record);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The consumer counterpart uses KafkaConsumer to subscribe to topics and poll messages.

A typical consumer example:

public class KafkaConsumer {
    public static final String brokerList = "localhost:9092";
    public static final String topic = "topic-demo";
    public static final String groupId = "group.demo";
    public static final AtomicBoolean isRunning = new AtomicBoolean(true);

    public static Properties initConfig(){
        Properties props = new Properties();
        props.put("key.deserializer",
                "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer",
                "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("bootstrap.servers", brokerList);
        props.put("group.id", groupId);
        props.put("client.id", "consumer.client.id.demo");
        return props;
    }

    public static void main(String[] args) {
        Properties props = initConfig();
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Arrays.asList(topic));

        try {
            while (isRunning.get()) {
                ConsumerRecords<String, String> records = 
                    consumer.poll(Duration.ofMillis(1000));
                for (ConsumerRecord<String, String> record : records) {
                    System.out.println("topic = " + record.topic() 
                            + ", partition = "+ record.partition() 
                            + ", offset = " + record.offset());
                    System.out.println("key = " + record.key()
                            + ", value = " + record.value());
                    //do something to process record.
                }
            }
        } catch (Exception e) {
            log.error("occur exception ", e);
        } finally {
            consumer.close();
        }
    }
}

Core Principles of Kafka

The most important design principles include:

Storage mechanism

Replication and replica strategy

Log design

Controller component

Rebalance algorithm

Reliability design

Latency handling, dead‑letter queues, and retry mechanisms

Operations and Monitoring

If you are responsible for running a Kafka cluster, you should be familiar with the following tools and practices:

Topic management

Replica and message management

Permission management

Common operational scripts and utilities

Cross‑cluster backup

Kafka Source Code Reading

Refer to the article "Kafka Source Reading Tips" for guidance on navigating the code base.

Kafka Applications

Kafka is often used together with Spark or Flink for real‑time processing.

For Spark integration, the Maven dependency is:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
</dependency>

For Flink integration, the Maven dependency is:

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka-0.8_2.10</artifactId>
</dependency>

Kafka's Ambitions – Kafka Streams

Kafka Streams is a lightweight library that turns Kafka into a full‑featured stream‑processing engine. Its characteristics include:

Simple, lightweight library that can be embedded in any Java application without external dependencies.

Leverages Kafka’s partitioning for horizontal scalability and ordering guarantees.

Provides fault‑tolerant state stores for efficient stateful operations such as windowed joins and aggregations.

Supports exactly‑once processing semantics.

Offers record‑level processing for millisecond‑level latency.

Enables event‑time windows and handling of late‑arriving records.

Provides low‑level Processor API (similar to Storm’s spout/bolt) and high‑level DSL (similar to Spark’s map/group/reduce).

Kafka Streams gives developers direct control over application execution, making debugging and deployment straightforward.

As one of the most mature frameworks in the big‑data ecosystem, Kafka continues to evolve rapidly and remains essential for every big‑data developer.

Hello, I am Wang Zhiwu, a hardcore original author in the big‑data field. I have worked on backend architecture, data middleware, data platforms & architecture, and algorithm engineering. I focus on real‑time dynamics, technical improvement, personal growth, and career advancement in the big‑data domain. Feel free to follow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Streaming consumer producer

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.