How to Build a Kafka‑Level High‑Performance Message Queue from Scratch
This article presents a step‑by‑step guide to designing and implementing a Kafka‑class distributed log‑based message queue kernel, covering architecture, sequential writes, sparse indexing, zero‑copy I/O, partitioning, replication, consumer‑group metadata, batch pipelines, crash recovery, and performance benchmarks.
Overall Architecture Skeleton
Producer
↓
Gateway / Router
↓
Partition Leader
↓
Append Log (Segment + WAL)
↓
Followers Replication
↓
Commit Index
↓
ConsumerCore data structures:
Topic
└── Partition
├── Segment Log (data)
├── Index Files (offset → file position)
├── TimeIndex (timestamp → offset)
├── WAL
└── Replica StateLog as a Database – Sequential Write is the Primary Driver
1. Segment File Layout
/data/topicA/partition-0/
├── 00000000000000000000.log
├── 00000000000000000000.index
├── 00000000000000000000.timeindex
├── 00000000001073741824.log
...2. Write Model
struct Record {
uint64_t offset;
uint32_t length;
uint8_t data[];
};Append flow (simplified):
uint64_t append(MessageBatch batch) {
wal.append(batch); // 1. write WAL first
wal.fsync(); // 2. ensure durability
uint64_t offset = segment.append(batch); // 3. write main log
if (needFlush()) {
segment.fsync_async(); // optional async flush
}
return offset;
}All writes are strictly sequential.
OS page‑cache can deliver >3 GB/s throughput.
SSD sequential write typically >1 GB/s.
Index System – Small and Efficient
Sparse index + binary search + sequential scan
1. Sparse Index Structure
struct IndexEntry {
uint64_t offset;
uint64_t file_pos;
};One index entry is stored for every 1,000 records.
Offset: 0 1000 2000 3000
Index : [0] → [1000] → [2000] → [3000]Lookup example for offset = 2150:
floorEntry(2000) → sequential scan of 150 records2. Tiered Index Cache
ThreadLocal Cache
↓
Hot Index (SkipList)
↓
Cold Index (MMap) IndexEntry find(uint64_t target) {
if (l1.contains(target)) return l1.get(target);
auto e = l2.floorEntry(target);
l1.put(target, e);
return e;
}Zero‑Copy I/O – Keeping CPU Out of the Data Path
1. sendfile
sendfile(socket_fd, log_fd, &offset, length);Data path: Disk → PageCache → NIC, no user‑space copy.
2. writev Batch Send
writev(sock, iov, iovcnt);Reduces system‑call overhead by sending multiple buffers in one call.
Partition – Foundation of Scalability
One Topic = N sequential‑write logs
Topic
├── Partition 0 → Node1
├── Partition 1 → Node2
├── Partition 2 → Node3Producer selects a partition with a simple hash:
partition = hash(key) % N;Replication Protocol – Strong Reliability
Leader
├── Follower1
└── Follower2Append and replicate flow:
void appendAndReplicate(batch, acks) {
uint64_t offset = leader.append(batch);
if (acks == ACKS_0) return; // fire‑and‑forget
sendToFollowers(batch);
if (acks == ACKS_ALL) {
waitForMajority(); // block until majority acked
}
commitIndex = offset; // advance commit index
}Acknowledgement levels (ordered by reliability):
0 – lowest reliability, minimal latency.
1 – medium reliability, low latency.
all – highest reliability, slightly higher latency.
Consumer Group Metadata System
Server‑side unified consumption progress
struct ConsumerOffset {
string group;
int partition;
uint64_t offset;
};Offsets are stored in one of the following durable stores:
Built‑in MetaLog.
RocksDB.
A dedicated internal topic (Kafka‑style).
Batch Processing Pipeline
Producer
class BatchProducer:
def send(self, msg):
buffer.append(msg)
if len(buffer) >= 100 or timeout():
flush()Consumer
class BatchFetcher:
def fetch(self, max_bytes):
... # read up to max_bytes from the logBatching can increase throughput by an order of magnitude.
Crash Recovery Procedure
Read the WAL.
Replay any unflushed batches.
Validate segment files (checksum, length).
Rebuild missing index files.
Synchronize the commit index with the leader.
Resume normal service.
Performance Metrics on a Single NVMe Node
Sequential Write : 1–3 GB/s.
Single‑Partition Throughput : 200–500 MB/s.
Message TPS : millions of messages per second.
Latency : 1–5 ms typical.
One‑Sentence Architectural Summary
Kafka is not merely a "message middleware"; it is a distributed storage engine built around sequential logs combined with a streaming consumption model.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
