Why LinkedIn Dropped Kafka for Northguard – A Deep Dive into Its Architecture
LinkedIn, the creator of Kafka, has largely abandoned Kafka in favor of a new log storage system called Northguard, whose design mirrors Apache Pulsar with features like storage‑compute separation, log striping, and a multi‑layer data model, offering superior scalability, operability, consistency, and durability for massive data streams.
In the message‑streaming domain, LinkedIn—originally the creator of Kafka—has quietly shifted away from Kafka and built a new system called Northguard. This change marks a significant adjustment in LinkedIn’s technical strategy.
Why a new solution was needed
LinkedIn now serves over 1.2 billion members, processing more than 320 trillion records per day across 150 clusters and 10 000 machines. The rapid growth created challenges in scalability, operability, availability, consistency, and durability, prompting the move to a more extensible log storage system.
Introducing Northguard
Northguard is a log storage system focused on high scalability and ease of operation. It uses data‑and‑metadata sharding, a minimal global state, and a distributed cluster‑member protocol to achieve these goals. Log striping automatically balances load across the cluster.
In practice, Northguard runs as a broker cluster where brokers only interact with connected clients and other brokers, forming a closed, efficient system.
Data model
Clients produce and consume records (key, value, and user‑defined header). Records are stored in segments , ordered collections that become sealed after reaching size or time limits. Segments are grouped into ranges , which map to continuous key‑space intervals. Multiple ranges form a topic , which can be split or merged. The design mirrors Apache Pulsar’s layered architecture.
Metadata model
Northguard stores metadata for topics, ranges, and segments in a set of virtual nodes (vnodes) backed by a Raft‑based replicated state machine. Each vnode’s leader (the coordinator) manages the metadata lifecycle, including creation, sealing, deletion, and replica placement based on administrator‑defined policies.
Log striping
Northguard avoids resource skew by striping logs into small blocks, each with its own replica set. New brokers automatically receive new segments, eliminating the need for costly segment migration and providing self‑balancing load distribution.
Cluster state and member management
Northguard uses the SWIM protocol for scalable group membership, providing fault detection and gossip‑based dissemination of cluster state, including broker attributes and vnode leadership information.
Protocols
Metadata requests follow a unary model (e.g., CreateTopicRequest, DeleteTopicRequest). Production, consumption, and replication use session‑based streaming with flow control windows, allowing high‑throughput, low‑latency data transfer.
Segment storage
Segments are stored using a pluggable "fps" storage layer with write‑ahead logging, direct I/O, and sparse indexes in RocksDB. Batching, time‑based flushing, and direct I/O improve durability and avoid double‑buffering issues.
Key takeaways
Northguard achieves high scalability through storage‑compute separation and log striping.
Metadata is managed by a Raft‑based vnode system, providing strong consistency.
SWIM‑based membership and session‑based streaming protocols enable low‑latency, high‑throughput operations.
Deep comparison: Northguard vs. Apache Pulsar
The architectural concepts of Northguard—separated storage and compute, segment‑based storage, and log striping—are strikingly similar to Apache Pulsar’s design, which has been proven in large‑scale deployments since 2016.
While LinkedIn could adopt Pulsar directly, historical investment in Kafka tooling, internal requirements, and the “Not Invented Here” mindset likely motivated the development of a custom solution that nevertheless validates Pulsar’s design principles.
Conclusion
LinkedIn’s Northguard indirectly confirms the correctness of Apache Pulsar’s architecture, demonstrating that the challenges Pulsar was built to solve—scalability, operability, and cloud‑native deployment—are exactly the problems faced by large‑scale messaging platforms today.
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
