How LinkedIn Scaled Kafka to Process Over 1 Trillion Messages Daily
Since 2011, LinkedIn has expanded its Kafka deployment from handling billions to over a trillion messages per day, focusing on quotas, a new ZooKeeper‑free consumer, reliability enhancements, security, monitoring frameworks, fault‑injection testing, cluster balancing, and ecosystem integrations, offering valuable lessons for large‑scale streaming systems.
LinkedIn began large‑scale use of Apache Kafka in July 2011, processing about 1 billion messages per day. By 2012 the volume grew to 200 billion daily, reaching 2 trillion per day in July 2013, and later surpassing 1 trillion messages per day with peak rates of over 4.5 million messages per second and weekly throughput of 1.34 PB. Over four years the system grew 1,200×.
Key Focus Areas on Kafka
Quotas
Because many applications share the same Kafka cluster, abusive usage can impact performance and SLA for others. Even legitimate workloads, such as re‑processing an entire database, can saturate network and disks. LinkedIn introduced a feature that throttles producers and consumers when their byte‑rate exceeds a threshold, with a whitelist allowing higher bandwidth for selected users.
New Consumer Development
The original Kafka consumer depended on ZooKeeper, causing security and split‑brain issues. In collaboration with Confluent and the open‑source community, LinkedIn helped develop a new consumer that relies only on the Kafka broker, eliminating the ZooKeeper dependency. The new client reconciles low‑level and high‑level consumer semantics, simplifying error handling, retries, and other responsibilities.
Reliability and Availability Enhancements
LinkedIn contributed several reliability improvements:
Loss‑less MirrorMaker : Modified MirrorMaker to acknowledge messages only after they are successfully written to the target topic, preventing data loss during upgrades or restarts.
Replica‑Lag Monitoring : Switched replica health checks from byte‑lag thresholds to time‑based thresholds, reducing false‑positive unhealthy states for large or growing messages.
New Producer : Implemented a pipelined producer to boost throughput; the feature is still being refined.
Safe Topic Deletion : Fixed numerous bugs so that topics can be deleted safely in upcoming Kafka releases.
Security
LinkedIn participated in adding encryption, authentication, and authorization features to Kafka, aiming to enable encryption in 2015 and additional security capabilities in 2016.
Kafka Monitoring Framework
LinkedIn built a standardized monitoring framework that runs test applications publishing and consuming topics to verify basic functionality (ordering, delivery guarantees, data integrity) and end‑to‑end latency. The framework also validates new Kafka releases for production readiness.
Fault Injection Testing
After acquiring a new Kafka version, LinkedIn runs fault‑injection tests using the "Simoorg" framework, which injects low‑level failures such as disk write errors, shutdowns, and process kills to assess version robustness.
Application Latency Monitoring
The "Burrow" tool monitors consumer lag, providing insight into application health.
Maintaining Cluster Balance
LinkedIn ensures cluster balance across several dimensions:
Rack Awareness : Partitions and their replicas are placed in different racks to avoid simultaneous loss during rack failures.
Even Partition Distribution : Topics are spread evenly across brokers to maximize bandwidth.
Disk and Network Capacity : Prevents hot‑spotting of partitions that could exhaust disk or network resources.
Site Reliability Engineers regularly rebalance partitions and have prototyped smarter placement algorithms.
Kafka Ecosystem at LinkedIn
Beyond the core broker, client, and MirrorMaker, LinkedIn runs internal services:
Non‑Java Clients : A redesigned REST proxy ensures reliable delivery for non‑Java applications.
Schema Registry : A mature schema service registers message schemas, enabling automatic deserialization on the consumer side.
Cost Accounting : An audit topic records per‑application Kafka usage for downstream analysis.
Audit System : Offline jobs report hourly and daily event counts, and expose usage metrics via REST for Hadoop and other services.
Large Message Support : While the default size limit is 1 MB, LinkedIn provides APIs to split and reassemble larger payloads transparently.
Kafka also backs up the Espresso NoSQL store, feeds asynchronous uploads to Venice, and serves as the event source and persistent storage for Apache Samza, with LinkedIn contributing bug fixes and log‑compression enhancements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
