Understanding Apache Kafka: Architecture, Core Principles, and Use Cases
This article introduces Apache Kafka as a fast, scalable distributed publish‑subscribe system, explains its core components, Zookeeper coordination, startup workflow, key features, and common scenarios such as log collection, activity tracking, and stream processing.
1. Introduction
Apache Kafka is a distributed publish‑subscribe messaging system originally developed by LinkedIn and contributed to the Apache Foundation in 2010. It provides a fast, scalable, partitioned and replicated commit log service.
2. Basic Architecture
The main components are:
Topic – a category or feed name to which messages are published.
Producer – any entity that publishes messages to a topic.
Broker – a server that stores published messages; a Kafka cluster consists of multiple brokers.
Consumer – an entity that subscribes to one or more topics and pulls data from brokers.
The diagram shows producers sending data to brokers, brokers holding multiple topics, and consumers pulling data from brokers.
3. Core Principles
Producers publish data to brokers, which store it; consumers pull data from brokers for processing. The system is distributed: producers, brokers, and consumers can run on separate machines and coordinate via Zookeeper.
4. Role of Zookeeper
Zookeeper stores meta‑information for the Kafka cluster and coordinates producers, consumers and brokers, enabling high availability, subscription management and load balancing.
5. Execution Process
Typical startup sequence:
Start Zookeeper servers.
Start Kafka broker servers.
Producers discover brokers through Zookeeper and send messages.
Consumers discover brokers through Zookeeper and pull messages.
6. Kafka Features
High throughput and low latency (hundreds of thousands of messages per second, millisecond latency).
Scalability – supports hot‑scale‑out of clusters.
Durability and reliability – messages are persisted to disk with replication.
Fault tolerance – can tolerate node failures.
High concurrency – thousands of clients can read/write simultaneously.
Supports both real‑time stream processing (e.g., Storm, Spark Streaming) and batch processing (e.g., Hadoop).
7. Typical Use Cases
Log collection and centralisation.
Message decoupling between producers and consumers.
User activity tracking for web or app interactions.
Operational metrics aggregation.
Streaming processing pipelines (Spark Streaming, Storm).
Event sourcing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
