How LinkedIn Scales Kafka to Over 1 Trillion Messages Daily
LinkedIn’s engineering team details how they scaled Kafka from a few billion to over a trillion daily messages, covering quotas, a new ZooKeeper‑free consumer, reliability upgrades, security roadmaps, monitoring frameworks, failure testing, cluster balancing, and ecosystem integrations.
Background and Scale
Kafka is a core pillar of LinkedIn’s data infrastructure. Since July 2011, LinkedIn has grown its Kafka usage from roughly 1 billion messages per day to more than 1 trillion, peaking at over 4.5 million messages per second and processing 1.34 PB of data weekly. Each message is typically handled by four applications, representing a 1 200‑fold growth over four years.
Key Focus Areas
As the scale expanded, LinkedIn emphasized reliability, cost, security, availability, and other fundamental metrics, prompting extensive exploration across multiple Kafka features.
Quotas
Because many applications share a single Kafka cluster, abusive usage can degrade performance for others. LinkedIn introduced a quota system that throttles producers and consumers when per‑second byte rates exceed a threshold. A whitelist mechanism allows trusted users to obtain higher bandwidth without affecting broker stability.
New Consumer Development
The traditional Kafka consumer relied on ZooKeeper, leading to security concerns and split‑brain scenarios. LinkedIn, together with Confluent and the open‑source community, built a new consumer that interacts directly with the broker, eliminating ZooKeeper dependencies. The new design reconciles low‑level and high‑level consumer semantics, allowing fine‑grained partition control while preserving automatic partition assignment.
Reliability and Availability Enhancements
To mitigate critical defects in new Kafka releases, LinkedIn implemented several improvements:
Mirror Maker lossless transfer : Redesigned to confirm message delivery before marking consumption complete.
Replica lag monitoring : Shifted from byte‑based to time‑based lag thresholds to better detect unhealthy replicas under high load.
New Producer pipeline : Introduced a pipelined producer for higher throughput, currently being refined.
Safe Topic deletion : Extensively tested and fixed deletion bugs, enabling reliable removal in upcoming Kafka versions.
Security Roadmap
LinkedIn plans to add encryption, authentication, and authorization features to Kafka, targeting encryption rollout in 2015 and broader security capabilities by 2016.
Kafka Monitoring Framework
A standardized monitoring suite runs test applications that publish and consume topics to verify ordering, delivery guarantees, data integrity, and end‑to‑end latency. The framework also validates that new Kafka releases are production‑ready without breaking existing clients.
Failure Testing (Simoorg)
LinkedIn’s Simoorg framework injects low‑level faults—such as disk write failures, shutdowns, and process kills—to assess Kafka’s resilience under adverse conditions.
Application Latency Monitoring (Burrow)
The Burrow tool monitors consumer lag, providing insight into application health and processing delays.
Cluster Balancing Strategies
LinkedIn maintains cluster equilibrium through:
Cabinet awareness: Ensuring primary and replica partitions are placed in different racks to avoid single‑point failures.
Fair partition distribution: Spreading topic partitions across brokers to maximize bandwidth.
Resource capacity checks: Preventing disk and network saturation on overloaded nodes.
SRE‑driven rebalancing: Periodic partition migrations to sustain balance.
Kafka in Other Data Systems
Kafka backs up LinkedIn’s Espresso NoSQL store, supports asynchronous uploads to Venice, and serves as the event source for Apache Samza’s real‑time stream processing, with enhancements to log compaction.
LinkedIn’s Kafka Ecosystem
Beyond the core broker and client libraries, LinkedIn provides internal services such as:
Non‑Java client support via a redesigned REST interface.
A schema registry that automatically registers and serves message schemas for consumers.
Cost accounting through an audit topic that records per‑application usage.
An audit system exposing usage metrics via REST for downstream analytics.
Large‑message handling: Automatic splitting and reassembly of messages exceeding the 1 MB limit.
These practices have inspired many other companies—including Yahoo!, Twitter, Netflix, and Uber—to adopt similar Kafka strategies for data analytics and stream processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
