Big Data 30 min read

Master Kafka: From Core Concepts to Real-World Deployment

This comprehensive guide explains Kafka’s architecture, core APIs, topics and partitions, deployment steps, multi‑broker clustering, and practical use cases such as messaging, log aggregation, stream processing, and data import/export with Kafka Connect, providing a hands‑on tutorial for developers and engineers.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Kafka: From Core Concepts to Real-World Deployment

1. Understanding Kafka

1.1 Kafka Overview

Kafka is a distributed streaming platform. Official site: http://kafka.apache.org/.

The platform provides three key functions:

Publish and subscribe to record streams , similar to a message queue.

Store record streams fault‑tolerantly .

Process records as they occur.

Kafka is typically used for two major application types:

Building reliable real‑time data pipelines between systems or applications.

Building real‑time stream processing applications that transform or react to data streams.

Key concepts include:

Kafka runs as a cluster on one or more servers across data centers.

The cluster stores record streams in categories called topics .

Each record contains a key, a value, and a timestamp.

Kafka exposes four core APIs:

Producer API – allows applications to publish records to one or more topics.

Consumer API – allows applications to subscribe to topics and process the resulting record streams.

Streams API – lets applications act as stream processors, consuming from input topics, transforming data, and producing to output topics.

Connector API – enables building and running connectors that link Kafka topics to external systems such as databases.

Communication between clients and servers uses a simple, high‑performance, language‑agnostic TCP protocol that is versioned and backward compatible. Kafka provides a Java client and clients for many other languages.

1.2 Topics and Partitions

A topic is a category of messages; each topic is divided into multiple partitions, which are append‑log files on storage.

Each partition is an ordered, immutable sequence of records, each assigned a sequential offset that uniquely identifies it within the partition.

Kafka retains all published records for a configurable retention period (e.g., two days), after which they are discarded.

Consumers read records in the order stored in the log, and each consumer group provides load‑balanced consumption across its instances. Offsets are controlled by the consumer, allowing replay or skipping to the latest record.

Partitions enable scalability and parallelism: each partition can be placed on a different server, and the number of partitions determines the maximum parallelism.

1.3 Distribution

Partitions of a topic are distributed across multiple servers in the Kafka cluster. Each server (broker) is responsible for the read/write of its assigned partitions. Replication creates multiple copies of each partition on different brokers for fault tolerance. One broker acts as the leader for a partition, handling all reads and writes; if the leader fails, a follower takes over.

1.4 Producers and Consumers

1.4.1 Producers

Producers publish data to specified topics and can choose the target partition using strategies such as round‑robin or custom algorithms.

1.4.2 Consumers

Each consumer belongs to a consumer group; only one consumer in the group reads from a given partition.

If all instances share the same group, records are load‑balanced across them.

If instances belong to different groups, each record is broadcast to all groups.

Analysis of a two‑broker cluster with four partitions and two consumer groups demonstrates how partitions are assigned to consumers.

1.5 Kafka as a Messaging System

Traditional messaging systems offer queuing or publish‑subscribe models. Kafka combines both: it provides multi‑user topics with ordering guarantees, high throughput, built‑in partitioning, replication, and fault tolerance.

1.6 Kafka as a Storage System

Kafka acts as a durable log that stores published records on disk with replication.

Writes are persisted to disk and can be acknowledged only after full replication.

The disk layout scales from small to very large data volumes.

Kafka can be viewed as a high‑performance, low‑latency distributed commit log.

1.7 Kafka for Stream Processing

Beyond simple read/write, Kafka supports real‑time stream processing.

The Streams API enables applications to consume from input topics, perform stateful transformations, and produce to output topics.

It solves challenges such as out‑of‑order data, reprocessing after code changes, and stateful calculations.

2. Kafka Use Cases

2.1 Messaging

Kafka can replace traditional message brokers, offering higher throughput, built‑in partitioning, replication, and durability, making it suitable for large‑scale messaging applications.

2.2 Website Activity Tracking

Kafka can ingest user activity streams (page views, searches) into separate topics for real‑time processing, monitoring, or offline analysis in Hadoop or data warehouses.

2.3 Metrics

Kafka is often used to aggregate operational metrics from distributed applications into a centralized feed.

2.4 Log Aggregation

Kafka abstracts log files as message streams, enabling low‑latency processing and easier consumption from multiple sources compared to systems like Scribe or Flume.

2.5 Stream Processing

Complex pipelines can read from Kafka topics, transform data, and write to new topics for downstream consumption, e.g., news recommendation pipelines using Kafka Streams, Apache Storm, or Apache Samza.

2.6 Event Sourcing

Kafka’s support for massive append‑only logs makes it an excellent backend for event‑sourced applications.

2.7 Commit Log

Kafka can serve as an external commit log for distributed systems, aiding data replication and node recovery.

3. Installing Kafka

3.1 Download and Extract

[root@along ~]# wget http://mirrors.shu.edu.cn/apache/kafka/2.1.0/kafka_2.11-2.1.0.tgz
[root@along ~]# tar -C /data/ -xvf kafka_2.11-2.1.0.tgz
[root@along ~]# cd /data/kafka_2.11-2.1.0/

3.2 Configure and Start Zookeeper

Kafka requires Zookeeper. Install Java, then start Zookeeper:

[root@along ~]# yum -y install java-1.8.0
[root@along ~]# nohup zookeeper-server-start.sh /data/kafka_2.11-2.1.0/config/zookeeper.properties &

3.3 Configure Kafka

broker.id=0
listeners=PLAINTEXT://localhost:9092
log.dirs=/tmp/kafka-logs
num.partitions=1
zookeeper.connect=localhost:2181
# other properties omitted for brevity

3.4 Start Kafka Service

[root@along ~]# service kafka start
Starting kafka (via systemctl): [  OK  ]
[root@along ~]# ss -nutl
tcp LISTEN 0 50 :::9092 :::*

4. Simple Kafka Quick‑Start

4.1 Create a Topic

[root@along ~]# kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic along
Created topic "along".

4.2 Send Messages

[root@along ~]# kafka-console-producer.sh --broker-list localhost:9092 --topic along
>This is a message
>This is another message

4.3 Start a Consumer

[root@along ~]# kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic along

5. Setting Up a Multi‑Broker Cluster

5.1 Prepare Configuration Files

# copy server.properties for additional brokers
broker.id=1
listeners=PLAINTEXT://:9093
log.dirs=/tmp/kafka-logs-1

broker.id=2
listeners=PLAINTEXT://:9094
log.dirs=/tmp/kafka-logs-2

5.2 Start Additional Brokers

nohup kafka-server-start.sh /data/kafka_2.11-2.1.0/config/server-1.properties &
nohup kafka-server-start.sh /data/kafka_2.11-2.1.0/config/server-2.properties &

5.3 Create Replicated Topic

[root@along ~]# kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic
Created topic "my-replicated-topic".

5.4 Test Fault Tolerance

Kill the leader broker and verify that the remaining replicas continue to serve reads and writes.

6. Using Kafka Connect for Import/Export

6.1 Prepare Sample Data

# echo -e "foo
bar" > test.txt

6.2 Start Standalone Connectors

connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties

The source connector reads lines from test.txt and writes them to the connect-test topic; the sink connector reads from that topic and writes to test.sink.txt.

6.3 Verify Data Flow

# cat test.sink.txt
foo
bar

Consume the topic directly:

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginning

Append more lines to the source file and observe them appear in both the sink file and the topic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsStreamingKafkaTutorialInstallation
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.