Big Data 14 min read

Master Kafka: A Complete Learning Roadmap from Basics to Advanced Projects

This guide presents a step‑by‑step Kafka learning roadmap covering core concepts, architecture, configuration, monitoring tools, practical project ideas, advanced components like Streams and KSQL, plus code samples and resource recommendations to help beginners become proficient in real‑time data streaming.

Big Data Tech Team

Jul 3, 2025

Master Kafka: A Complete Learning Roadmap from Basics to Advanced Projects

Core Concepts

Apache Kafka is a distributed streaming platform that provides high‑throughput, low‑latency, fault‑tolerant messaging. It is used to build real‑time data pipelines and streaming applications.

Producer – client that publishes records to a Kafka topic .

Consumer – client that reads records, optionally as part of a consumer group for parallel consumption.

Broker – server that stores topic partitions and serves read/write requests.

Topic – logical name for a stream of records; each topic is split into partitions for scalability.

Partition – ordered, immutable sequence of records; each record has an offset.

Architecture Overview

Cluster consists of one or more brokers; replication factor determines how many brokers store copies of each partition.

Producers can choose a partition by key hashing or round‑robin.

Consumers in the same group share the partitions of a topic, ensuring each partition is consumed by only one member.

ZooKeeper (or KRaft in newer versions) coordinates broker metadata and controller election.

Key Configuration Parameters

Broker

log.retention.hours

– how long logs are kept. message.max.bytes – maximum size of a single record. num.partitions – default number of partitions for newly created topics.

Producer

bootstrap.servers

– list of broker addresses. acks – durability level (0, 1, all). batch.size and linger.ms – control batching. compression.type – gzip, snappy, lz4, zstd.

Consumer

group.id

– identifier of the consumer group. enable.auto.commit – whether offsets are committed automatically. auto.offset.reset – earliest or latest when no committed offset exists.

Message Guarantees

At‑least‑once – default; duplicates may appear.

At‑most‑once – disable retries, commit before processing.

Exactly‑once – enable idempotent producer and transactional APIs.

Monitoring & Operations

Important metrics: records-in-per-sec, bytes-out-per-sec, request-latency-avg, under-replicated-partitions.

Tools: Kafka Manager, Confluent Control Center, Prometheus + Grafana (via kafka-exporter), Kafka Monitor.

Typical troubleshooting steps: check broker logs, verify ZooKeeper/KRaft health, examine consumer lag, inspect partition ISR status.

Code Examples

Java producer :

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("acks", "all");
props.put("compression.type", "gzip");

KafkaProducer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++) {
    ProducerRecord<String, String> record = new ProducerRecord<>("logs", "Log message " + i);
    producer.send(record);
}
producer.close();

Java consumer :

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.time.Duration;
import java.util.Collections;
import java.util.Properties;

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "log-consumer-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("enable.auto.commit", "true");
props.put("auto.offset.reset", "earliest");

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("logs"));
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        System.out.printf("offset=%d, key=%s, value=%s%n", record.offset(), record.key(), record.value());
    }
}

KSQL stream for user actions (AVRO format) :

CREATE STREAM user_actions (
    user_id VARCHAR,
    action VARCHAR,
    timestamp BIGINT
) WITH (KAFKA_TOPIC='user-actions', VALUE_FORMAT='AVRO');

SELECT user_id, COUNT(*) AS action_count
FROM user_actions
WINDOW TUMBLING (SIZE 1 MINUTE)
GROUP BY user_id
HAVING COUNT(*) > 100;

Kafka Connect MySQL source configuration (JSON) :

{
  "name": "mysql-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "localhost",
    "database.port": "3306",
    "database.user": "root",
    "database.password": "password",
    "database.server.id": "184054",
    "database.include.list": "mydb",
    "table.include.list": "mydb.users",
    "database.history.kafka.bootstrap.servers": "localhost:9092",
    "database.history.kafka.topic": "schema-changes.mydb"
  }
}

Kafka Streams data processor (Java) :

import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.common.serialization.Serdes;
import java.util.Properties;

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "data-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("user-actions");
KStream<String, String> processed = source.mapValues(String::toUpperCase);
processed.to("processed-user-actions");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));

Advanced Ecosystem Components

Kafka Streams – library for building stateful stream processing applications; supports windowing, joins, and exactly‑once semantics.

KSQL – SQL‑like interface for interactive queries over streams and tables.

Schema Registry – central service for managing Avro/JSON schemas and ensuring compatibility.

Kafka Connect – framework for scalable, fault‑tolerant ingestion and egress; connectors exist for databases, Elasticsearch, S3, etc.

Confluent Platform – commercial distribution that adds security, multi‑region replication, and advanced monitoring tools.

Practical Project Outline

Set up a single‑node Kafka cluster (download from https://kafka.apache.org/downloads, extract, start Zookeeper and Kafka broker).

Expand to a multi‑node cluster by configuring broker.id, listeners, and zookeeper.connect on each host.

Implement a log‑collection pipeline: producers write log lines to a logs topic, a consumer stores them in a file or database.

Build a real‑time analytics job using Kafka Streams or KSQL to aggregate sensor data or social‑media events.

Enable compression, idempotent producer, and transactions to achieve exactly‑once delivery.

Deploy Prometheus with kafka-exporter and Grafana dashboards to monitor latency, throughput, and under‑replicated partitions.

References

Official Apache Kafka documentation: https://kafka.apache.org/documentation/

Kafka Streams documentation: https://kafka.apache.org/documentation/streams/

KSQL documentation: https://ksql.io/

Schema Registry project: https://github.com/confluentinc/schema-registry

Kafka Connect documentation: https://kafka.apache.org/documentation/#connect

Monitoring Streaming Kafka Code examples tutorial

Written by

Big Data Tech Team

Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.