Master Kafka: From Core Concepts to Real-World Deployment
This comprehensive guide explains Kafka’s architecture, core APIs, topics and partitions, deployment steps, multi‑broker clustering, and practical use cases such as messaging, log aggregation, stream processing, and data import/export with Kafka Connect, providing a hands‑on tutorial for developers and engineers.
1. Understanding Kafka
1.1 Kafka Overview
Kafka is a distributed streaming platform. Official site: http://kafka.apache.org/.
The platform provides three key functions:
Publish and subscribe to record streams , similar to a message queue.
Store record streams fault‑tolerantly .
Process records as they occur.
Kafka is typically used for two major application types:
Building reliable real‑time data pipelines between systems or applications.
Building real‑time stream processing applications that transform or react to data streams.
Key concepts include:
Kafka runs as a cluster on one or more servers across data centers.
The cluster stores record streams in categories called topics .
Each record contains a key, a value, and a timestamp.
Kafka exposes four core APIs:
Producer API – allows applications to publish records to one or more topics.
Consumer API – allows applications to subscribe to topics and process the resulting record streams.
Streams API – lets applications act as stream processors, consuming from input topics, transforming data, and producing to output topics.
Connector API – enables building and running connectors that link Kafka topics to external systems such as databases.
Communication between clients and servers uses a simple, high‑performance, language‑agnostic TCP protocol that is versioned and backward compatible. Kafka provides a Java client and clients for many other languages.
1.2 Topics and Partitions
A topic is a category of messages; each topic is divided into multiple partitions, which are append‑log files on storage.
Each partition is an ordered, immutable sequence of records, each assigned a sequential offset that uniquely identifies it within the partition.
Kafka retains all published records for a configurable retention period (e.g., two days), after which they are discarded.
Consumers read records in the order stored in the log, and each consumer group provides load‑balanced consumption across its instances. Offsets are controlled by the consumer, allowing replay or skipping to the latest record.
Partitions enable scalability and parallelism: each partition can be placed on a different server, and the number of partitions determines the maximum parallelism.
1.3 Distribution
Partitions of a topic are distributed across multiple servers in the Kafka cluster. Each server (broker) is responsible for the read/write of its assigned partitions. Replication creates multiple copies of each partition on different brokers for fault tolerance. One broker acts as the leader for a partition, handling all reads and writes; if the leader fails, a follower takes over.
1.4 Producers and Consumers
1.4.1 Producers
Producers publish data to specified topics and can choose the target partition using strategies such as round‑robin or custom algorithms.
1.4.2 Consumers
Each consumer belongs to a consumer group; only one consumer in the group reads from a given partition.
If all instances share the same group, records are load‑balanced across them.
If instances belong to different groups, each record is broadcast to all groups.
Analysis of a two‑broker cluster with four partitions and two consumer groups demonstrates how partitions are assigned to consumers.
1.5 Kafka as a Messaging System
Traditional messaging systems offer queuing or publish‑subscribe models. Kafka combines both: it provides multi‑user topics with ordering guarantees, high throughput, built‑in partitioning, replication, and fault tolerance.
1.6 Kafka as a Storage System
Kafka acts as a durable log that stores published records on disk with replication.
Writes are persisted to disk and can be acknowledged only after full replication.
The disk layout scales from small to very large data volumes.
Kafka can be viewed as a high‑performance, low‑latency distributed commit log.
1.7 Kafka for Stream Processing
Beyond simple read/write, Kafka supports real‑time stream processing.
The Streams API enables applications to consume from input topics, perform stateful transformations, and produce to output topics.
It solves challenges such as out‑of‑order data, reprocessing after code changes, and stateful calculations.
2. Kafka Use Cases
2.1 Messaging
Kafka can replace traditional message brokers, offering higher throughput, built‑in partitioning, replication, and durability, making it suitable for large‑scale messaging applications.
2.2 Website Activity Tracking
Kafka can ingest user activity streams (page views, searches) into separate topics for real‑time processing, monitoring, or offline analysis in Hadoop or data warehouses.
2.3 Metrics
Kafka is often used to aggregate operational metrics from distributed applications into a centralized feed.
2.4 Log Aggregation
Kafka abstracts log files as message streams, enabling low‑latency processing and easier consumption from multiple sources compared to systems like Scribe or Flume.
2.5 Stream Processing
Complex pipelines can read from Kafka topics, transform data, and write to new topics for downstream consumption, e.g., news recommendation pipelines using Kafka Streams, Apache Storm, or Apache Samza.
2.6 Event Sourcing
Kafka’s support for massive append‑only logs makes it an excellent backend for event‑sourced applications.
2.7 Commit Log
Kafka can serve as an external commit log for distributed systems, aiding data replication and node recovery.
3. Installing Kafka
3.1 Download and Extract
[root@along ~]# wget http://mirrors.shu.edu.cn/apache/kafka/2.1.0/kafka_2.11-2.1.0.tgz
[root@along ~]# tar -C /data/ -xvf kafka_2.11-2.1.0.tgz
[root@along ~]# cd /data/kafka_2.11-2.1.0/3.2 Configure and Start Zookeeper
Kafka requires Zookeeper. Install Java, then start Zookeeper:
[root@along ~]# yum -y install java-1.8.0
[root@along ~]# nohup zookeeper-server-start.sh /data/kafka_2.11-2.1.0/config/zookeeper.properties &3.3 Configure Kafka
broker.id=0
listeners=PLAINTEXT://localhost:9092
log.dirs=/tmp/kafka-logs
num.partitions=1
zookeeper.connect=localhost:2181
# other properties omitted for brevity3.4 Start Kafka Service
[root@along ~]# service kafka start
Starting kafka (via systemctl): [ OK ]
[root@along ~]# ss -nutl
tcp LISTEN 0 50 :::9092 :::*4. Simple Kafka Quick‑Start
4.1 Create a Topic
[root@along ~]# kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic along
Created topic "along".4.2 Send Messages
[root@along ~]# kafka-console-producer.sh --broker-list localhost:9092 --topic along
>This is a message
>This is another message4.3 Start a Consumer
[root@along ~]# kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic along5. Setting Up a Multi‑Broker Cluster
5.1 Prepare Configuration Files
# copy server.properties for additional brokers
broker.id=1
listeners=PLAINTEXT://:9093
log.dirs=/tmp/kafka-logs-1
broker.id=2
listeners=PLAINTEXT://:9094
log.dirs=/tmp/kafka-logs-25.2 Start Additional Brokers
nohup kafka-server-start.sh /data/kafka_2.11-2.1.0/config/server-1.properties &
nohup kafka-server-start.sh /data/kafka_2.11-2.1.0/config/server-2.properties &5.3 Create Replicated Topic
[root@along ~]# kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic
Created topic "my-replicated-topic".5.4 Test Fault Tolerance
Kill the leader broker and verify that the remaining replicas continue to serve reads and writes.
6. Using Kafka Connect for Import/Export
6.1 Prepare Sample Data
# echo -e "foo
bar" > test.txt6.2 Start Standalone Connectors
connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.propertiesThe source connector reads lines from test.txt and writes them to the connect-test topic; the sink connector reads from that topic and writes to test.sink.txt.
6.3 Verify Data Flow
# cat test.sink.txt
foo
barConsume the topic directly:
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginningAppend more lines to the source file and observe them appear in both the sink file and the topic.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
