Why LSM Tree Powers Modern Key‑Value Stores: Design, Write Path & Compaction
This article explains the fundamentals of Log‑Structured Merge (LSM) Trees, covering their write‑ahead log, memtable, SSTable architecture, compaction processes, read/write optimizations, and popular open‑source implementations such as LevelDB, RocksDB, and GoLevelDB.
What is an LSM Tree?
LSM Tree (Log Structured Merge Tree) is a storage architecture that maximizes sequential writes, originally designed for mechanical disks. Its core component is the log file, and it functions as a key/value storage engine.
Question 1: What is the LSM Tree storage engine?
It is simply a key/value storage engine.
Question 2: How does a user write data?
The write flow consists of two steps: appending to a log file and updating an in‑memory structure (memtable). This makes the write latency very low.
Question 3: How does a user delete data?
Deletion is also performed by writing a delete marker to the log, not by modifying existing data.
Question 4: Is the storage persistent? Will data be lost on power failure?
Yes, because the log is persisted; data can be recovered by scanning the log file.
Question 5: Log files have no built‑in search capability—what then?
Since logs are ordered by time but unordered by key, LSM introduces ordered SST files (Sorted String Table) to enable efficient reads.
Question 6: Why are there many SST files?
Log data is continuously transformed into new SST files, so the system maintains many ordered SST files, each divided into blocks and restart points.
Question 7: How is redundant or deleted space reclaimed?
Valid data is read from SST files and written to new files; old files are then removed. This process is called compaction .
Design Philosophy of LSM Tree
LSM optimizes write performance by keeping I/O sequential, using batching and aggregation. Reads are slower and rely on additional structures such as caches, pre‑fetching, and various indexes.
LSM Tree Architecture
CURRENT – a text file storing the name of the current MANIFEST file.
MANIFEST – a binary file containing metadata for the whole cluster.
Log (WAL) – the write‑ahead log that directly impacts write performance.
Memtable and Immutable Memtable – in‑memory structures, often implemented as skip‑lists.
SSTable files – ordered files storing key/value pairs; size varies.
Both Memtable and SSTable provide read‑time optimizations.
Data flows from the log to the memtable, then to Level‑0 SSTables, and subsequently merges down to lower levels.
Write Process
Data is first written to the log file.
Data is then inserted into the memtable.
This two‑step flow ensures durability (log persistence) and extreme write efficiency.
Read Process
Reading starts from the memtable; if the key is not found, the system checks Level‑0 SSTables, then Level‑1, Level‑2, and so on, resulting in higher I/O cost compared to writes.
To improve read performance, LSM implementations add indexes such as:
Block index at the end of each SSTable.
Bloom filters for quick existence checks.
Restart point arrays inside each block.
Compaction Process
LevelDB defines two compaction types:
Minor compaction – moves data from memtable to Level‑0 SSTables.
Major compaction – merges SSTables across levels.
Level‑0 files may overlap in key ranges, but lower levels do not, so the number of Level‑0 files must be limited.
Notable Open‑Source Implementations
LevelDB – Google’s C++ key/value store (github.com/google/leveldb).
RocksDB – Facebook’s optimized fork of LevelDB (github.com/facebook/rocksdb).
GoLevelDB – A Go implementation (github.com/syndtr/goleveldb).
For Go developers interested in LSM, studying GoLevelDB’s source is highly recommended.
Parsing LSM Files with Python
The author used a Python script to parse LevelDB’s manifest and SST files, gaining deeper insight.
Example of writing a uint64 value in Go: binary.BigEndian.PutUint64(buf, 0x123456789) Resulting byte sequence: 0x00, 0x00, 0x00, 0x01, 0x23, 0x45, 0x67, 0x89 Reading the same bytes in Python: struct.unpack(">Q", buf) Key takeaway: data transfer between memory, disk, or network is best represented as byte arrays.
Why LSM Is Being Questioned Today
In the HDD era, sequential writes were essential, making LSM ideal. With SSDs offering high random performance and parallel channels, the strict sequential‑only optimization of LSM can become unnecessary overhead.
Research such as the FAST paper “WiscKey: Separating Keys from Values in SSD‑Conscious Storage” proposes SSD‑aware alternatives, leading to projects like BadgerDB and BlobDB.
Summary
LSM Tree maximizes write performance through sequential I/O and batching.
When sequential I/O is no longer the bottleneck (e.g., SSDs), LSM may add unnecessary complexity.
LSM guarantees safety and consistency via append‑only logs and atomic file renames.
Read performance is compensated by indexes, bloom filters, and caches.
Compaction occurs in two forms: minor (memtable → SST) and major (SST ↔ SST).
Before SSDs, write optimizations were limited to order and batch; reads required many tricks.
NiuNiu MaTe
Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
