Big Data 12 min read

LinkedIn’s Kafka at Scale: Architecture, Optimizations, and Operational Practices

The article details how LinkedIn has scaled Kafka from handling billions to trillions of messages daily, describing quota enforcement, a ZooKeeper‑free consumer, reliability enhancements, security plans, monitoring frameworks, fault‑injection testing, cluster balancing, and integration with other internal data systems.

Art of Distributed System Architecture Design

Nov 30, 2015

LinkedIn’s Kafka at Scale: Architecture, Optimizations, and Operational Practices

Kafka is a core pillar of LinkedIn’s data infrastructure. LinkedIn engineers have published a series of articles on Kafka’s current state, future direction, large‑scale operation, and alignment with the company’s open‑source strategy. Senior engineering manager Kartik Paramasivam recently shared their usage and optimization experience.

LinkedIn began large‑scale Kafka usage in July 2011, processing about 1 billion messages per day. By 2012 this grew to 200 billion, and by July 2013 to 2000 billion daily messages. More recently the platform handles over 1 trillion messages per day, peaking at 4.5 million messages per second and processing 1.34 PB per week, achieving a 1 200‑fold growth over four years.

With this growth, LinkedIn focuses on Kafka’s reliability, cost, security, availability, and other foundational metrics, exploring many features and areas.

Quotas – Because multiple applications share a Kafka cluster, abusive usage can impact other tenants. LinkedIn introduced a quota system that throttles producers and consumers when byte‑rate thresholds are exceeded, with a whitelist for higher‑bandwidth users, a feature slated for the next Kafka release.

New Consumer Development – The traditional Kafka consumer depends on ZooKeeper, which introduces security and split‑brain issues. LinkedIn, together with Confluent and the open‑source community, built a new consumer that talks directly to Kafka brokers, eliminating ZooKeeper dependency. The new consumer reconciles low‑level and high‑level consumer APIs.

Reliability and Availability Enhancements – LinkedIn has improved several components:

MirrorMaker : Modified to guarantee loss‑less data transfer, ensuring messages are only considered consumed after successful delivery to the target topic.

Replica Lag Monitoring : Switched from byte‑based lag thresholds to time‑based rules to better detect unhealthy replicas.

New Producer : Implemented a pipelined producer to boost performance, currently being refined.

Safe Topic Deletion : Fixed numerous bugs so that topics can be deleted safely in the next major Kafka version.

Security – Encryption, authentication, and authorization features are being added, with encryption expected in 2015 and additional security capabilities in 2016.

Kafka Monitoring Framework – LinkedIn is building a standardized test suite that publishes and consumes topics to verify ordering, delivery guarantees, data integrity, and end‑to‑end latency, as well as to validate new Kafka releases for production readiness.

Fault‑Injection Testing – The “Simoorg” framework injects low‑level failures (disk write errors, shutdowns, process kills) to assess the robustness of new Kafka versions.

Application Latency Monitoring – The Burrow tool monitors consumer lag to gauge application health.

Cluster Balancing – LinkedIn ensures balanced placement of partitions across racks, distributes topic partitions evenly among brokers, and avoids disk or network saturation on any node. Site Reliability Engineers periodically rebalance partitions using custom designs and prototypes.

Integration with Other Data Systems – Kafka backs up LinkedIn’s Espresso NoSQL store, feeds asynchronous data to Venice, and serves as the event source for Apache Samza, which also uses Kafka for state persistence. LinkedIn has contributed fixes and enhancements to Kafka’s log‑compaction feature.

LinkedIn’s Kafka Ecosystem – Beyond the broker and client libraries, LinkedIn provides internal services such as a REST interface for non‑Java clients, a schema registry for automatic deserialization, cost‑accounting via an audit topic, an audit system exposing usage metrics to downstream jobs, and a large‑message handling feature that splits and reassembles messages exceeding the 1 MB limit.

These experiences are intended to help other companies—such as Yahoo!, Twitter, Netflix, and Uber—adopt and operate Kafka at massive scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Big Data kafka Reliability LinkedIn

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.