Databases 11 min read

Overview of LSM‑Tree Architecture and Its Use in Modern Databases

LSM‑Tree stores writes in an in‑memory MemTable then flushes ordered SSTables to disk, using Bloom filters and indexes to speed reads, while periodic compactions merge files; modern systems such as LevelDB, HBase, and ClickHouse adopt this design to achieve high write throughput despite slower point and range queries and occasional compaction overhead.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Overview of LSM‑Tree Architecture and Its Use in Modern Databases

LSM‑Tree (Log‑Structured Merge‑Tree) is a storage structure that separates the write path into an in‑memory component (MemTable) and an ordered on‑disk component (SSTable). Writes are first appended sequentially to a commit log and to the MemTable; when the MemTable reaches a size threshold it is flushed to disk as an SSTable.

Read operations first probe the MemTable; if the key is not found, the system checks each SSTable’s Bloom filter. A positive filter result triggers a binary search on the SSTable’s index to locate the data offset. If the key is absent, the next SSTable is examined.

Each SSTable is accompanied by an index table and a Bloom filter, which together reduce disk I/O. When the number of SSTables grows, a compaction process merges them, discarding deleted or overwritten entries. Compaction can be minor (flush MemTable), major (merge levels), or full (merge all SSTables).

LevelDB extends the basic LSM‑Tree with an ImmuTable to avoid service interruption during MemTable flushes, a multi‑level SSTable hierarchy (level0 > level1 > ...), a manifest file that records SSTable metadata, and a current file pointing to the latest manifest.

HBase adopts the LSM‑Tree idea in its HStore: an in‑memory MemStore (ordered structure) and on‑disk HFile. When MemStore exceeds a threshold, it is flushed to an HFile, and multiple HFiles are periodically merged.

ClickHouse’s MergeTree engine also follows LSM‑Tree principles, writing data to an in‑memory buffer, flushing to disk partitions, and merging partitions using index and offset files.

The article concludes with a comparative summary of LevelDB, HBase, and ClickHouse, highlighting their write‑fast nature, advantages (high write throughput) and disadvantages (slower point queries, range queries, and occasional compaction overhead).

DatabaseLSM TreeStorage EngineClickHouseHBaseBloom FilterLevelDB
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.