Big Data 13 min read

Understanding Apache Kafka: Core Concepts, Architecture, and Use Cases

This article explains Apache Kafka as a distributed streaming platform, detailing its key features, core APIs, topic and log architecture, partitioning, consumer groups, guarantees, and how it serves both messaging and storage roles for real‑time and batch processing in big‑data environments.

Big Data Technology & Architecture

May 13, 2019

Understanding Apache Kafka: Core Concepts, Architecture, and Use Cases

Apache Kafka® is a distributed streaming platform that enables publishing and subscribing to streams of records, provides durable storage with fault tolerance, and allows processing of records as they are generated.

Kafka’s three main characteristics are: stream publishing/subscribing similar to a message queue, durable storage with strong fault tolerance, and real‑time processing of incoming records.

Typical Kafka use cases include building real‑time data pipelines (reliable data transfer between systems, akin to a message queue) and creating real‑time stream processing applications that transform or act on the data (using Kafka Streams to move data between topics).

Key concepts: a Kafka cluster runs on one or more servers; data is organized into topics, each divided into ordered, immutable partitions; each record contains a key, value, and timestamp. Offsets uniquely identify records within a partition.

Kafka provides four core APIs: the Producer API for publishing records to topics, the Consumer API for subscribing to topics, the Streams API for building stream processing applications that consume input topics and produce output topics, and the Connector API for integrating Kafka with external systems such as relational databases.

Topics act as data subjects; each topic can have multiple subscribers. For each topic, the cluster maintains partition logs where records are appended in order and identified by offsets. Kafka retains all published records for a configurable retention period, allowing consumption long after production.

Consumers belong to consumer groups; records are load‑balanced among group members, and the same record can be broadcast to multiple groups. Offsets are managed by consumers, enabling them to rewind or skip records as needed.

Kafka guarantees that records within a partition are ordered, that producers send messages to a specific partition in order, that consumers read records in log order, and that with N replicas the system can tolerate up to N‑1 broker failures without data loss.

Compared with traditional messaging systems, Kafka combines the benefits of queues (scalable processing) and publish‑subscribe (multiple subscribers) while providing stronger ordering guarantees and fault‑tolerant storage.

As a storage system, Kafka writes data to disk with replication for durability, supports massive data volumes, and offers low‑latency, high‑throughput access, effectively acting as a distributed log‑based file system.

For stream processing, Kafka enables continuous data flow from input topics through processing logic (simple producer/consumer APIs or the more powerful Streams API) to output topics, supporting real‑time transformations, aggregations, joins, and stateful computations.

Kafka also supports batch processing by combining stored data with low‑latency streaming, allowing applications to handle both historical and incoming data within a unified platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems stream processing message queues Real-time Data Apache Kafka

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.