Databases 38 min read

Building a Distributed Database Storage Engine: From LSM Tree to Data Sharding

This article walks through building a database storage engine from a simple shell script to a full distributed key‑value system, covering in‑memory indexing, SSTable creation, LSM‑Tree architecture with compaction, replication strategies, and sharding techniques for scaling across multiple machines.

Tencent Cloud Developer

Dec 4, 2024

Building a Distributed Database Storage Engine: From LSM Tree to Data Sharding

This article provides a comprehensive exploration of building a database storage engine from scratch, starting with a simple two-line shell script and progressively evolving into a distributed key-value database.

Part 1: Single-Machine Storage Engine

The author begins by demonstrating the most basic storage engine using a shell script with db_set and db_get functions. While write performance is excellent (append-only), read performance suffers due to full file scans. To address this, the article introduces in-memory Hash Map indexing to improve read performance by storing byte offsets of keys.

To prevent disk space exhaustion from multiple writes to the same key, the author proposes segmented files with compaction. When keys are sorted within segments, they become SSTables (Sorted String Tables), enabling efficient merge operations using multi-way merge sort and sparse indexing in memory.

The article details how to construct and maintain SSTables using skip lists in memory (memtable and immutable), and explains the SSTable file format including DataBlock, IndexBlock, and Footer structure. The reading process involves multiple levels of binary search: first in the IndexBlock to locate the DataBlock, then within the DataBlock to find the actual key-value pair.

To handle data deletion and crash recovery, the article introduces tombstones (special markers for deleted records) and Write-Ahead Log (WAL). The complete architecture forms an LSM Tree (Log-Structured Merge Tree), which is the foundation of databases like LevelDB, RocksDB, Cassandra, and HBase.

The article then compares LSM Tree with B+ Tree: LSM Tree excels in write-heavy workloads due to sequential writes, while B+ Tree performs better for read-heavy scenarios. LSM Tree suffers from write amplification during compaction, while B+ Tree experiences write amplification from random writes and page splits.

Part 2: Data Replication

To ensure reliability against machine failures, the article discusses three replication strategies: master-slave replication, multi-master replication, and leaderless replication. Each approach has trade-offs in consistency, availability, and partition tolerance.

Part 3: Data Sharding

For scenarios where a single machine cannot store all data, the article explores sharding strategies including range-based partitioning and hash-based partitioning, along with rebalancing mechanisms and request routing approaches.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LSM‑Tree Storage Engine distributed database data sharding B+Tree SSTable WAL

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.