Big Data 11 min read

An Introduction to Kafka: Architecture, Core Components, Service Governance, Performance Optimizations, and Installation Guide

Kafka is a high‑throughput distributed publish‑subscribe system that uses brokers, topics, partitions, offsets, producers, consumers, and Zookeeper for metadata and leader election, offering fast sequential disk writes, page‑cache zero‑copy transfers, ISR‑based replication, and includes step‑by‑step installation of JDK, Zookeeper, and Kafka.

Tencent Cloud Developer

May 27, 2021

An Introduction to Kafka: Architecture, Core Components, Service Governance, Performance Optimizations, and Installation Guide

Kafka is a high‑throughput distributed messaging system originally developed by LinkedIn. It follows a publish‑subscribe model and is widely used for building real‑time data pipelines and streaming applications.

Application Scenarios

Asynchronous decoupling of upstream and downstream services.

System buffering to handle mismatched throughput among services.

Peak‑shaving for short‑term traffic spikes.

Real‑time data stream processing (e.g., integration with Spark).

Kafka Topology (Replication)

Each partition has multiple replicas; the cluster is managed by Zookeeper, which stores metadata such as brokers, topics, and partitions.

Core Components

Broker : A Kafka server node that stores and forwards messages. A broker can host multiple topics.

Topic : Logical category of messages.

Partition : A topic is split into partitions, enabling parallel processing. Each partition consists of several segment files that are read and written sequentially.

Offset : The sequential position of a message within a partition, serving as a unique identifier.

Producer : Client that publishes messages to a broker.

Consumer : Client that reads messages from brokers.

Consumer Group : A set of consumers sharing the same group ID; each partition is consumed by only one consumer within the group.

Zookeeper : Manages cluster metadata, leader election, fault detection, and load balancing.

Service Governance

Kafka ensures data reliability through leader‑follower replication. Producers write to the leader; followers replicate the data. An acknowledgment (ACK) is sent only after the data is replicated to the in‑sync replica (ISR) set.

Data Synchronization

Each partition has one leader and multiple followers. The leader writes data, and followers pull it. Only when a follower is in the ISR does the leader consider the write successful.

ISR (In‑Sync Replica)

Kafka does not require all followers to be synchronized; it only waits for the replicas in the ISR. Followers that fall too far behind are removed from the ISR.

Fault Recovery & Leader Election

When a leader fails, Zookeeper triggers a Zab‑based election to promote a follower to leader. Producers then reconnect to the new leader.

Producer sends message to leader → leader stores data → ACK is lost due to failure.

Zookeeper elects a new leader → producer retries with the new leader.

Why Kafka Is Fast

Sequential Disk Writes : Messages are appended sequentially, avoiding random‑seek overhead.

Page Cache : Kafka relies on the OS page cache instead of JVM buffers, reducing GC pauses and enabling zero‑copy transfers.

Zero‑Copy : Uses system calls like sendfile() to transfer data directly from kernel buffers to the network socket, cutting CPU context switches.

Partition Segmentation : Each partition is stored in multiple segment files, allowing binary search on offsets for fast lookups.

Compression : Supports Gzip and Snappy to reduce bandwidth and storage usage.

Installation Guide

1. Install JDK

Check available Java packages: yum -y list Java* Install JDK 1.8: yum install java-1.8.0-openjdk-devel.x86_64 Verify installation: java -version 2. Install Zookeeper

Download and extract the package: tar -zxvf zookeeper-3.4.9.tar.gz Copy the sample configuration and edit:

cp zoo_sample.cfg zoo.cfg

vim zoo.cfg

Key configuration parameters:

# tickTime in ms

tickTime=2000

# Max heartbeats between leader and follower

initLimit=10

# Heartbeats for request/response

syncLimit=5

# Data directory

dataDir=/tmp/zookeeper

# Client port

clientPort=2181

Add Zookeeper to PATH:

vim ~/.bash_profile

export ZK=/usr/local/src/apache-zookeeper-3.7.0-bin

export PATH=$PATH:$ZK/bin

zkServer.sh start

3. Install Kafka

Download Kafka source package:

🔗 https://www.apache.org/dyn/closer.cgi?path=/kafka/2.8.0/kafka-2.8.0-src.tgz

Extract the archive: tar -xzvf kafka_2.12-2.0.0.tgz Set environment variables:

export KAFKA=/usr/local/src/kafka

export PATH=$PATH:$KAFKA/bin

Start Kafka server: nohup kafka-server-start.sh /path/to/server.properties & After these steps, Kafka is ready for use.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Big Data Zookeeper Kafka Installation Distributed Messaging

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.