Big Data 13 min read

How LinkedIn Scales Kafka to Trillions of Messages: Lessons in Reliability, Cost, and Security

LinkedIn’s engineering team details how they have scaled Apache Kafka from billions to over a trillion daily messages, focusing on quotas, a new ZooKeeper‑free consumer, reliability enhancements, security features, monitoring frameworks, and ecosystem integrations to improve cost, availability, and performance.

21CTO

May 15, 2016

How LinkedIn Scales Kafka to Trillions of Messages: Lessons in Reliability, Cost, and Security

LinkedIn's data infrastructure relies heavily on Apache Kafka. Engineers have published a series of articles on its current state, scaling, open‑source strategy, and integration with the tech stack. Senior Engineering Manager Kartik Paramasivam recently shared their experience optimizing Kafka.

Quotas

Because many applications share the same Kafka cluster, abusive usage can impact performance and SLA for other apps. Even legitimate workloads, such as re‑processing an entire database, can saturate network and disks. LinkedIn introduced a feature that throttles producers and consumers when per‑second byte rates exceed a threshold, with a whitelist for high‑bandwidth users.

Developing a New Consumer

The existing Kafka consumer depends on ZooKeeper, which introduces security concerns and split‑brain scenarios. LinkedIn, together with Confluent and the open‑source community, built a new consumer that relies only on the Kafka broker, eliminating ZooKeeper dependencies. Two consumer types exist: a low‑level consumer for full partition control and a high‑level consumer that automatically balances partitions among instances. LinkedIn’s new consumer reconciles both models.

Reliability and Availability Improvements

At LinkedIn’s scale, any critical defect in a new Kafka release can severely affect reliability. The team focuses on defect detection and fixes, implementing several enhancements:

MirrorMaker lossless transfer: Modified design to confirm messages reach the target topic before marking them consumed.

Replica lag monitoring: Switched from byte‑based lag thresholds to time‑based evaluation to better detect unhealthy replicas.

New Producer pipeline: Introduced a pipelined producer for higher throughput, currently being refined.

Safe Topic deletion: Fixed numerous bugs so future Kafka versions can delete topics without instability.

Security

Security is a major focus; LinkedIn plans to enable encryption in 2015 and additional security features in 2016, adding encryption, authentication, and authorization to Kafka.

Kafka Monitoring Framework

LinkedIn is building a standardized monitoring framework that runs test applications publishing and consuming Kafka topics to verify ordering, delivery guarantees, data integrity, and end‑to‑end latency. The framework also validates that new Kafka versions are production‑ready.

Failure Testing

When a new open‑source Kafka version is released, LinkedIn runs failure tests using a framework called Simoorg, which injects low‑level faults such as disk write failures, shutdowns, and process kills.

Application Latency Monitoring

Consumers use a tool named Burrow to monitor consumption latency, providing insight into application health.

Keeping the Kafka Cluster Balanced

LinkedIn maintains cluster balance across several dimensions:

Rack awareness: Primary and replica partitions are placed in different racks to avoid simultaneous rack failures.

Even partition distribution: After applying quotas, partitions are spread evenly across brokers to maximize bandwidth.

Preventing disk and network exhaustion: Ensuring no single broker becomes a hotspot for many large partitions.

Site Reliability Engineers periodically rebalance partitions to keep the cluster healthy, implementing prototype designs for smarter placement.

Kafka in Other Data Systems

LinkedIn uses Espresso as a NoSQL store and employs Kafka as a backup mechanism, placing it on latency‑sensitive paths while optimizing for low latency and high reliability. Kafka also feeds asynchronous uploads to Venice and serves as the event source for Apache Samza, which uses Kafka for state persistence and has benefited from bug fixes and enhanced log compression.

LinkedIn’s Kafka Ecosystem

Beyond the core Kafka broker, client, and MirrorMaker, LinkedIn has built internal services for common messaging needs:

Non‑Java clients: A redesigned REST service ensures reliable delivery for non‑Java applications.

Schema registry: A mature schema service registers message schemas, enabling automatic deserialization on the consumer side.

Cost accounting: An audit topic records per‑application Kafka usage for cost analysis.

Audit system: Offline jobs report hourly and daily event counts, with REST APIs exposing usage metrics for downstream services.

Large message support: Messages larger than 1 MB are split into fragments and reassembled automatically by the client SDK.

LinkedIn’s experience with Kafka is valuable for other companies such as Yahoo!, Twitter, Netflix, and Uber, which use Kafka for data analysis and stream processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

scalability Streaming kafka Reliability LinkedIn

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.