Operations 22 min read

Why LinkedIn Dropped Kafka for Northguard – A Deep Dive into Its Architecture

LinkedIn, the creator of Kafka, has largely abandoned Kafka in favor of a new log storage system called Northguard, whose design mirrors Apache Pulsar with features like storage‑compute separation, log striping, and a multi‑layer data model, offering superior scalability, operability, consistency, and durability for massive data streams.

Wukong Talks Architecture
Wukong Talks Architecture
Wukong Talks Architecture
Why LinkedIn Dropped Kafka for Northguard – A Deep Dive into Its Architecture

In the message‑streaming domain, LinkedIn—originally the creator of Kafka—has quietly shifted away from Kafka and built a new system called Northguard. This change marks a significant adjustment in LinkedIn’s technical strategy.

Why a new solution was needed

LinkedIn now serves over 1.2 billion members, processing more than 320 trillion records per day across 150 clusters and 10 000 machines. The rapid growth created challenges in scalability, operability, availability, consistency, and durability, prompting the move to a more extensible log storage system.

Introducing Northguard

Northguard is a log storage system focused on high scalability and ease of operation. It uses data‑and‑metadata sharding, a minimal global state, and a distributed cluster‑member protocol to achieve these goals. Log striping automatically balances load across the cluster.

In practice, Northguard runs as a broker cluster where brokers only interact with connected clients and other brokers, forming a closed, efficient system.

Data model

Clients produce and consume records (key, value, and user‑defined header). Records are stored in segments , ordered collections that become sealed after reaching size or time limits. Segments are grouped into ranges , which map to continuous key‑space intervals. Multiple ranges form a topic , which can be split or merged. The design mirrors Apache Pulsar’s layered architecture.

Record structure
Record structure
Segment containing multiple records
Segment containing multiple records
Range composed of segments
Range composed of segments

Metadata model

Northguard stores metadata for topics, ranges, and segments in a set of virtual nodes (vnodes) backed by a Raft‑based replicated state machine. Each vnode’s leader (the coordinator) manages the metadata lifecycle, including creation, sealing, deletion, and replica placement based on administrator‑defined policies.

Raft group for a vnode
Raft group for a vnode

Log striping

Northguard avoids resource skew by striping logs into small blocks, each with its own replica set. New brokers automatically receive new segments, eliminating the need for costly segment migration and providing self‑balancing load distribution.

Cluster after adding a new broker
Cluster after adding a new broker

Cluster state and member management

Northguard uses the SWIM protocol for scalable group membership, providing fault detection and gossip‑based dissemination of cluster state, including broker attributes and vnode leadership information.

SWIM protocol in operation
SWIM protocol in operation

Protocols

Metadata requests follow a unary model (e.g., CreateTopicRequest, DeleteTopicRequest). Production, consumption, and replication use session‑based streaming with flow control windows, allowing high‑throughput, low‑latency data transfer.

Producer stream with acknowledgments
Producer stream with acknowledgments
Consumer stream receiving records
Consumer stream receiving records

Segment storage

Segments are stored using a pluggable "fps" storage layer with write‑ahead logging, direct I/O, and sparse indexes in RocksDB. Batching, time‑based flushing, and direct I/O improve durability and avoid double‑buffering issues.

Key takeaways

Northguard achieves high scalability through storage‑compute separation and log striping.

Metadata is managed by a Raft‑based vnode system, providing strong consistency.

SWIM‑based membership and session‑based streaming protocols enable low‑latency, high‑throughput operations.

Deep comparison: Northguard vs. Apache Pulsar

The architectural concepts of Northguard—separated storage and compute, segment‑based storage, and log striping—are strikingly similar to Apache Pulsar’s design, which has been proven in large‑scale deployments since 2016.

While LinkedIn could adopt Pulsar directly, historical investment in Kafka tooling, internal requirements, and the “Not Invented Here” mindset likely motivated the development of a custom solution that nevertheless validates Pulsar’s design principles.

Conclusion

LinkedIn’s Northguard indirectly confirms the correctness of Apache Pulsar’s architecture, demonstrating that the challenges Pulsar was built to solve—scalability, operability, and cloud‑native deployment—are exactly the problems faced by large‑scale messaging platforms today.

distributed-systemsscalabilitylog storageApache PulsarLinkedInNorthguard
Wukong Talks Architecture
Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.