Why Build a Kafka System? Core Use Cases and Design Principles
This article explains why Kafka is essential for activity and operational data pipelines, outlines key use cases such as news feeds, relevance ranking, security, monitoring, and reporting, and details its deployment topology, design decisions, and message persistence strategies.
Why Build This System
Kafka is a messaging system originally developed at LinkedIn for activity streams and operational data pipelines, now used by many companies as a versatile data pipeline and messaging platform.
Activity stream data is the most common data used for site usage reporting, including page views, content views, and search activity. Operational data includes server performance metrics such as CPU, I/O usage, request latency, and service logs.
In recent years, processing activity and operational data has become a critical feature of web software products, requiring more complex infrastructure.
Use Cases for Activity Streams and Operational Data
1. "News feed" functionality that broadcasts friends' activities.
2. Relevance and ranking using count rating, votes, or click‑through rates.
3. Security: blocking malicious crawlers, rate‑limiting API usage, detecting spam, and supporting behavior detection and prevention.
4. Operational monitoring: real‑time, adaptive monitoring with alerting on issues.
5. Reporting and batch processing: loading data into data warehouses or Hadoop for offline analysis and business reporting.
Characteristics of Activity Stream Data
This high‑throughput, immutable data stream can be 10‑100 times larger than the next biggest data source on a site, posing a real challenge for compute capacity.
Traditional log file analysis works well for offline batch processing but incurs high latency for real‑time needs. Existing messaging systems handle real‑time workloads but struggle with long‑standing queues and persistence. Kafka aims to be a queue platform that supports both offline and online use cases.
Kafka offers very generic messaging semantics and, while this article focuses on activity processing, it is not limited to that purpose.
Deployment
The diagram below shows the topology of systems after deployment at LinkedIn.
A single Kafka cluster handles all activity data from various sources, providing a unified data pipeline for both online and offline consumers. Data is replicated to another data center for offline processing.
Kafka can support multi‑data‑center topologies via mirroring or synchronization. A mirror cluster acts as a consumer of the source cluster, allowing data from multiple data centers to be aggregated.
The upper part of the diagram shows two clusters without direct communication, possibly of different sizes. The lower part shows a single cluster that can mirror any number of source clusters.
Main Design Elements
Kafka differs from most information systems due to several key design decisions:
1. Persistence of messages was considered a normal use case from the start. 2. The primary design constraint is throughput, not features. 3. State about which data has been consumed is stored with the consumer, not on the server. 4. Kafka is an explicit distributed system, assuming producers, brokers, and consumers run on multiple machines.
These decisions are explained in detail later.
Fundamentals
Key terminology:
A message is the basic unit of communication. Producers publish messages to a topic, which are sent to broker servers. Consumers subscribe to topics and receive messages.
Kafka is an explicit distributed system where producers, consumers, and brokers form a logical cluster across machines. Each consumer belongs to a consumer group; each message is delivered to only one process within the group. Consumer groups enable both queue semantics (one consumer per group) and topic semantics (each consumer in its own group receives all messages). In large‑scale deployments, a message for a topic is stored only once regardless of the number of consumers.
Message Persistence and Caching
Do not fear the file system!
Kafka relies heavily on the file system for storage and caching. While disks are often considered slow, well‑designed disk structures can be as fast as the network. Modern disks have high sequential write throughput (e.g., ~300 MB/s) but much lower random write performance.
Operating systems use read‑ahead and write‑behind caching to mitigate performance variability, and they aggressively use free memory as page cache. This means data may exist both in the process cache and the OS page cache, effectively doubling available cache size.
Running on the JVM adds overhead: Java objects have large memory footprints, and garbage collection costs increase with heap size.
1. Java object overhead can be roughly twice the size of the actual data. 2. Garbage collection becomes more unpredictable as the heap grows.
Therefore, relying on the file system and OS page cache is more efficient than maintaining an in‑process cache. By writing all data to persistent logs without explicit flush calls, the OS handles caching and eventual disk persistence. Kafka also provides configurable flush policies (e.g., flush after N messages or M seconds) to bound data loss risk in case of hardware failure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
