Kafka Basics: Terminology, Architecture, and Configuration Overview
This article provides a comprehensive introduction to Apache Kafka, covering its core concepts, terminology, data flow, configuration options such as compression and partitioning, producer and consumer behavior, replication, consumer groups, offset management, and log retention strategies.
Kafka is a widely used open‑source distributed messaging system; this article introduces its essential concepts from a practical perspective.
Why use Kafka? It smooths traffic spikes (peak‑shaving) and decouples services, reducing direct dependencies and simplifying development.
Key terminology:
Broker – persists messages received from clients.
Topic – logical container for messages, often used to separate business domains.
Partition – ordered, immutable sequence of messages within a topic.
Message – the primary data unit processed by Kafka.
Offset – monotonically increasing position of a message in a partition.
Replica – copies of a partition for redundancy; includes leader and follower replicas.
Producer – publishes messages to topics.
Consumer – subscribes to topics to receive messages.
Consumer Offset – tracks each consumer’s progress, stored in an internal topic.
Consumer Group – a set of consumer instances that jointly consume partitions for high throughput.
Rebalance – automatic redistribution of partitions when group membership changes.
ZooKeeper’s role is to store cluster metadata such as broker list, topic definitions, partition leaders, and replica assignments.
Message format is a binary byte sequence; messages are structured but transmitted as raw bytes.
Message transport protocols include point‑to‑point and publish/subscribe models.
Message compression is configured on the producer side via the compression.type property, e.g., props.put("compression.type", "gzip") . The broker may re‑compress using a different algorithm (e.g., Snappy), which affects throughput and compression ratio.
Message decompression occurs on the consumer side after the broker delivers the raw bytes.
Partitioning strategies are implemented by providing a custom class that implements org.apache.kafka.clients.Partitioner with partition() and close() methods, then configuring partitioner.class on the producer. Common strategies are round‑robin (default), random (legacy), key‑based, and custom schemes such as geographic partitioning.
Producer TCP connection management creates a background Sender thread that opens connections to all brokers at startup; metadata is refreshed periodically via metadata.max.age.ms (default 300000 ms).
Producer message sending uses the callback‑based API producer.send(msg, callback) . Setting acks=all ensures all replicas acknowledge the write; retries enables automatic resend on transient failures.
Idempotent producer is enabled with props.put("enable.idempotence", true) , allowing Kafka to deduplicate messages on the broker for a single partition and session.
Transactional producer guarantees atomic writes across multiple partitions using the API:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch (KafkaException e) {
producer.abortTransaction();
}Consumers control visibility with isolation.level (read_uncommitted or read_committed).
Broker storage uses an append‑only log; log segments are rolled and old segments are deleted based on retention policies.
Replication copies data to multiple brokers; producers write to the leader replica, followers replicate asynchronously.
Consumer groups improve throughput by allowing multiple consumers to share the load; each partition is consumed by only one consumer within the group, though a consumer may handle multiple partitions.
Consumer fetching and ACK involves pulling batches of messages and committing offsets; implementations may use one thread per partition (preserving order) or separate fetch and processing threads (higher concurrency but no ordering guarantee).
Offset management moved from ZooKeeper to an internal topic __consumer_offsets ; offsets are compacted to prevent unbounded growth.
Rebalance triggers include changes in group membership, subscription pattern, or partition count.
Message ordering is guaranteed only within a single partition; global ordering requires a single partition or key‑based partitioning.
Log retention strategies include time‑based ( log.retention.hours ), size‑based ( log.retention.bytes ), or a combination of both.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.