How to Build a Kafka‑Level High‑Performance Message Queue from Scratch

This article presents a step‑by‑step guide to designing and implementing a Kafka‑class distributed log‑based message queue kernel, covering architecture, sequential writes, sparse indexing, zero‑copy I/O, partitioning, replication, consumer‑group metadata, batch pipelines, crash recovery, and performance benchmarks.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
How to Build a Kafka‑Level High‑Performance Message Queue from Scratch

Overall Architecture Skeleton

Producer
   ↓
Gateway / Router
   ↓
Partition Leader
   ↓
Append Log (Segment + WAL)
   ↓
Followers Replication
   ↓
Commit Index
   ↓
Consumer

Core data structures:

Topic
   └── Partition
        ├── Segment Log (data)
        ├── Index Files (offset → file position)
        ├── TimeIndex (timestamp → offset)
        ├── WAL
        └── Replica State

Log as a Database – Sequential Write is the Primary Driver

1. Segment File Layout

/data/topicA/partition-0/
├── 00000000000000000000.log
├── 00000000000000000000.index
├── 00000000000000000000.timeindex
├── 00000000001073741824.log
...

2. Write Model

struct Record {
    uint64_t offset;
    uint32_t length;
    uint8_t data[];
};

Append flow (simplified):

uint64_t append(MessageBatch batch) {
    wal.append(batch);          // 1. write WAL first
    wal.fsync();                // 2. ensure durability
    uint64_t offset = segment.append(batch); // 3. write main log
    if (needFlush()) {
        segment.fsync_async(); // optional async flush
    }
    return offset;
}

All writes are strictly sequential.

OS page‑cache can deliver >3 GB/s throughput.

SSD sequential write typically >1 GB/s.

Index System – Small and Efficient

Sparse index + binary search + sequential scan

1. Sparse Index Structure

struct IndexEntry {
    uint64_t offset;
    uint64_t file_pos;
};

One index entry is stored for every 1,000 records.

Offset:   0   1000   2000   3000
Index : [0] → [1000] → [2000] → [3000]

Lookup example for offset = 2150:

floorEntry(2000) → sequential scan of 150 records

2. Tiered Index Cache

ThreadLocal Cache
   ↓
Hot Index (SkipList)
   ↓
Cold Index (MMap)
IndexEntry find(uint64_t target) {
    if (l1.contains(target)) return l1.get(target);
    auto e = l2.floorEntry(target);
    l1.put(target, e);
    return e;
}

Zero‑Copy I/O – Keeping CPU Out of the Data Path

1. sendfile

sendfile(socket_fd, log_fd, &offset, length);

Data path: Disk → PageCache → NIC, no user‑space copy.

2. writev Batch Send

writev(sock, iov, iovcnt);

Reduces system‑call overhead by sending multiple buffers in one call.

Partition – Foundation of Scalability

One Topic = N sequential‑write logs
Topic
   ├── Partition 0 → Node1
   ├── Partition 1 → Node2
   ├── Partition 2 → Node3

Producer selects a partition with a simple hash:

partition = hash(key) % N;

Replication Protocol – Strong Reliability

Leader
   ├── Follower1
   └── Follower2

Append and replicate flow:

void appendAndReplicate(batch, acks) {
    uint64_t offset = leader.append(batch);
    if (acks == ACKS_0) return;               // fire‑and‑forget
    sendToFollowers(batch);
    if (acks == ACKS_ALL) {
        waitForMajority();                    // block until majority acked
    }
    commitIndex = offset;                      // advance commit index
}

Acknowledgement levels (ordered by reliability):

0 – lowest reliability, minimal latency.

1 – medium reliability, low latency.

all – highest reliability, slightly higher latency.

Consumer Group Metadata System

Server‑side unified consumption progress
struct ConsumerOffset {
    string group;
    int    partition;
    uint64_t offset;
};

Offsets are stored in one of the following durable stores:

Built‑in MetaLog.

RocksDB.

A dedicated internal topic (Kafka‑style).

Batch Processing Pipeline

Producer

class BatchProducer:
    def send(self, msg):
        buffer.append(msg)
        if len(buffer) >= 100 or timeout():
            flush()

Consumer

class BatchFetcher:
    def fetch(self, max_bytes):
        ...  # read up to max_bytes from the log

Batching can increase throughput by an order of magnitude.

Crash Recovery Procedure

Read the WAL.

Replay any unflushed batches.

Validate segment files (checksum, length).

Rebuild missing index files.

Synchronize the commit index with the leader.

Resume normal service.

Performance Metrics on a Single NVMe Node

Sequential Write : 1–3 GB/s.

Single‑Partition Throughput : 200–500 MB/s.

Message TPS : millions of messages per second.

Latency : 1–5 ms typical.

One‑Sentence Architectural Summary

Kafka is not merely a "message middleware"; it is a distributed storage engine built around sequential logs combined with a streaming consumption model.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

IndexingKafkalog storageReplicationMessage Queuezero-copy
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.