Deep Dive into Kafka Architecture: Topics, Partitions, and Reliable Data Pipelines
This article explains Kafka’s core concepts—including topics, partitions, log segmentation, indexing, and acknowledgment mechanisms—then provides a step‑by‑step guide to deploy a Zookeeper‑Kafka cluster integrated with Filebeat, Logstash, and the ELK stack for reliable log collection and analysis.
Kafka Architecture Deep Dive
In Kafka, messages are categorized by topic . Producers publish to a topic and consumers read from a topic.
Topic is a logical concept, while partition is a physical concept. Each partition corresponds to a log file that stores the data produced by the producer. Data is continuously appended to the end of the log file, and each record has its own offset. Consumer groups track the offset they have consumed to enable fault‑tolerant recovery.
To avoid oversized log files, Kafka splits each partition into multiple segments . Every segment consists of two files: a .index file and a .log file. The files are stored in a directory named topicName‑partitionNumber. For example, a topic test with three partitions creates directories test‑0, test‑1, and test‑2.
Index and log files are named after the offset of the first message in the current segment.
The .index file holds extensive index information, while the .log file stores the actual message data. Index metadata points to the physical offset of each message in the log file.
Data Reliability Guarantees
When a producer sends data to a partition, the partition acknowledges receipt (ACK). The producer proceeds only after receiving the ACK; otherwise it retries.
Data Consistency Concepts
LEO (Log End Offset) is the highest offset of any replica. HW (High Watermark) is the largest offset visible to consumers, i.e., the smallest LEO among all replicas.
Follower Failure
If a follower fails, it is temporarily removed from the ISR (In‑Sync Replicas). After recovery, the follower reads its last HW from local disk, truncates any log beyond HW, and synchronizes from the leader until its LEO reaches the partition’s HW, then rejoins the ISR.
Leader Failure
When the leader fails, a new leader is elected from the ISR. Remaining followers truncate any log beyond the new leader’s HW and then synchronize data from the new leader. This ensures replica consistency but does not guarantee zero data loss or duplication.
Acknowledgment Levels
Kafka offers three reliability levels controlled by request.required.acks:
0 : Producer does not wait for any broker acknowledgment. Highest throughput, lowest reliability.
1 (default): Producer waits for acknowledgment from the leader only. If the leader fails before followers replicate, data may be lost.
-1 or all : Producer waits for acknowledgment from all in‑sync replicas. Highest reliability; if the leader fails after followers have replicated but before sending ACK, duplicates may occur.
Since Kafka 0.11, idempotent producers ensure that duplicate sends are persisted only once on the broker.
Deploying Filebeat + Kafka + ELK
Environment Preparation
node1: 192.168.67.11 elasticsearch kibana
node2: 192.168.67.12 elasticsearch
apache: 192.168.67.10 logstash apache/nginx/mysql
Filebeat node: 192.168.67.13 Filebeat
zk‑kfk01: 192.168.67.21 zookeeper, kafka
zk‑kfk02: 192.168.67.22 zookeeper, kafka
zk‑kfk03: 192.168.67.23 zookeeper, kafka
systemctl stop firewalld
systemctl enable firewalld
setenforce 01. Deploy Zookeeper + Kafka Cluster
Restart services as needed, e.g.:
systemctl restart elasticsearch.service
netstat -antp | grep 9200
cd /usr/local/src/elasticsearch-head/
npm run start &2. Deploy Filebeat
cd /etc/filebeat
vim filebeat.yml
# Example snippet
filebeat.prospectors:
- type: log
enabled: true
paths:
- /var/log/httpd/access_log
tags: ["access"]
- type: log
enabled: true
paths:
- /var/log/httpd/error_log
tags: ["error"]
output.kafka:
enabled: true
hosts: ["192.168.67.21:9092","192.168.67.22:9092","192.168.67.23:9092"]
topic: "httpd"After editing, restart Filebeat:
systemctl restart filebeat.service
systemctl status filebeat.service
./filebeat -e -c filebeat.yml3. Deploy ELK (Logstash Configuration)
cd /etc/logstash/conf.d/
vim kafka.conf
input {
kafka {
bootstrap_servers => "192.168.67.21:9092,192.168.67.22:9092,192.168.67.23:9092"
topics => "httpd"
type => "httpd_kafka"
codec => "json"
auto_offset_reset => "latest"
decorate_events => true
}
}
output {
if "access" in [tags] {
elasticsearch { hosts => ["192.168.67.11:9200"] index => "httpd_access-%{+YYYY.MM.dd}" }
}
if "error" in [tags] {
elasticsearch { hosts => ["192.168.67.11:9200"] index => "httpd_error-%{+YYYY.MM.dd}" }
}
stdout { codec => rubydebug }
}Start Logstash (adjust data path if needed):
logstash -f kafka.conf --path.data=/opt4. Verify and Browse
Check Elasticsearch indices: curl -X GET "192.168.67.11:9200/_cat/indices?v" Access Kibana at http://192.168.67.11:5601/, create an index pattern httpd_access-*, and explore the collected logs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
