Big Data 11 min read

Deep Dive into Kafka Architecture: Topics, Partitions, and Reliable Data Pipelines

This article explains Kafka’s core concepts—including topics, partitions, log segmentation, indexing, and acknowledgment mechanisms—then provides a step‑by‑step guide to deploy a Zookeeper‑Kafka cluster integrated with Filebeat, Logstash, and the ELK stack for reliable log collection and analysis.

Raymond Ops
Raymond Ops
Raymond Ops
Deep Dive into Kafka Architecture: Topics, Partitions, and Reliable Data Pipelines

Kafka Architecture Deep Dive

In Kafka, messages are categorized by topic . Producers publish to a topic and consumers read from a topic.

Topic is a logical concept, while partition is a physical concept. Each partition corresponds to a log file that stores the data produced by the producer. Data is continuously appended to the end of the log file, and each record has its own offset. Consumer groups track the offset they have consumed to enable fault‑tolerant recovery.

To avoid oversized log files, Kafka splits each partition into multiple segments . Every segment consists of two files: a .index file and a .log file. The files are stored in a directory named topicName‑partitionNumber. For example, a topic test with three partitions creates directories test‑0, test‑1, and test‑2.

Index and log files are named after the offset of the first message in the current segment.

The .index file holds extensive index information, while the .log file stores the actual message data. Index metadata points to the physical offset of each message in the log file.

Data Reliability Guarantees

When a producer sends data to a partition, the partition acknowledges receipt (ACK). The producer proceeds only after receiving the ACK; otherwise it retries.

Data Consistency Concepts

LEO (Log End Offset) is the highest offset of any replica. HW (High Watermark) is the largest offset visible to consumers, i.e., the smallest LEO among all replicas.

Follower Failure

If a follower fails, it is temporarily removed from the ISR (In‑Sync Replicas). After recovery, the follower reads its last HW from local disk, truncates any log beyond HW, and synchronizes from the leader until its LEO reaches the partition’s HW, then rejoins the ISR.

Leader Failure

When the leader fails, a new leader is elected from the ISR. Remaining followers truncate any log beyond the new leader’s HW and then synchronize data from the new leader. This ensures replica consistency but does not guarantee zero data loss or duplication.

Acknowledgment Levels

Kafka offers three reliability levels controlled by request.required.acks:

0 : Producer does not wait for any broker acknowledgment. Highest throughput, lowest reliability.

1 (default): Producer waits for acknowledgment from the leader only. If the leader fails before followers replicate, data may be lost.

-1 or all : Producer waits for acknowledgment from all in‑sync replicas. Highest reliability; if the leader fails after followers have replicated but before sending ACK, duplicates may occur.

Since Kafka 0.11, idempotent producers ensure that duplicate sends are persisted only once on the broker.

Deploying Filebeat + Kafka + ELK

Environment Preparation

node1: 192.168.67.11    elasticsearch  kibana
node2: 192.168.67.12    elasticsearch
apache: 192.168.67.10   logstash  apache/nginx/mysql
Filebeat node: 192.168.67.13   Filebeat
zk‑kfk01: 192.168.67.21   zookeeper, kafka
zk‑kfk02: 192.168.67.22   zookeeper, kafka
zk‑kfk03: 192.168.67.23   zookeeper, kafka

systemctl stop firewalld
systemctl enable firewalld
setenforce 0

1. Deploy Zookeeper + Kafka Cluster

Restart services as needed, e.g.:

systemctl restart elasticsearch.service
netstat -antp | grep 9200
cd /usr/local/src/elasticsearch-head/
npm run start &

2. Deploy Filebeat

cd /etc/filebeat
vim filebeat.yml
# Example snippet
filebeat.prospectors:
- type: log
  enabled: true
  paths:
    - /var/log/httpd/access_log
  tags: ["access"]
- type: log
  enabled: true
  paths:
    - /var/log/httpd/error_log
  tags: ["error"]

output.kafka:
  enabled: true
  hosts: ["192.168.67.21:9092","192.168.67.22:9092","192.168.67.23:9092"]
  topic: "httpd"

After editing, restart Filebeat:

systemctl restart filebeat.service
systemctl status filebeat.service
./filebeat -e -c filebeat.yml

3. Deploy ELK (Logstash Configuration)

cd /etc/logstash/conf.d/
vim kafka.conf
input {
  kafka {
    bootstrap_servers => "192.168.67.21:9092,192.168.67.22:9092,192.168.67.23:9092"
    topics => "httpd"
    type => "httpd_kafka"
    codec => "json"
    auto_offset_reset => "latest"
    decorate_events => true
  }
}
output {
  if "access" in [tags] {
    elasticsearch { hosts => ["192.168.67.11:9200"] index => "httpd_access-%{+YYYY.MM.dd}" }
  }
  if "error" in [tags] {
    elasticsearch { hosts => ["192.168.67.11:9200"] index => "httpd_error-%{+YYYY.MM.dd}" }
  }
  stdout { codec => rubydebug }
}

Start Logstash (adjust data path if needed):

logstash -f kafka.conf --path.data=/opt

4. Verify and Browse

Check Elasticsearch indices: curl -X GET "192.168.67.11:9200/_cat/indices?v" Access Kibana at http://192.168.67.11:5601/, create an index pattern httpd_access-*, and explore the collected logs.

Kafka architecture diagram
Kafka architecture diagram
Filebeat configuration
Filebeat configuration
Logstash pipeline
Logstash pipeline
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataStreamingZooKeeperKafkaELKLog ManagementFilebeat
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.