Databases 12 min read

Log-Structured Merge Trees: Overview, History, Modern Design, Optimizations, and Concurrency Control

This article explains the principles, evolution, modern structures, compaction strategies, optimization techniques such as Bloom filters and partitioning, and concurrency and recovery mechanisms of Log-Structured Merge (LSM) trees, which are widely used in contemporary NoSQL storage systems.

Tencent Database Technology

Apr 24, 2020

Log-Structured Merge Trees: Overview, History, Modern Design, Optimizations, and Concurrency Control

1. Overview

Log-Structured Merge trees (LSM trees) are extensively used in modern NoSQL storage layers such as BigTable, Dynamo, HBase, Cassandra, LevelDB, RocksDB, and AsterixDB. Unlike traditional index structures like B+ trees that modify data in place, LSM trees first write data to memory and later flush it to disk via background merge threads, offering ultra‑high write performance, good space utilization, tunable optimization, simple concurrency control, and recovery mechanisms.

2. History of LSM Trees

Database storage engines typically handle updates via in‑place updates (e.g., B+ trees) or out‑of‑place updates. In‑place updates provide fast reads but cause random I/O, low write throughput, and severe page fragmentation. Out‑of‑place updates, as used by LSM trees, write new data to a new location, leveraging sequential I/O for high write performance and simplifying recovery, at the cost of slower reads because multiple locations must be consulted.

The original LSM design was proposed in 1996, introducing a merge process that achieves very high write performance while keeping read performance and space utilization within reasonable bounds.

LSM trees consist of a series of components (C in the diagram), each being a B+ tree. The first component resides in memory, while the others are on disk with increasing capacities. When a component fills, it is merged into the next larger component. The original paper showed that, with a fixed number of components, the optimal write performance is achieved when adjacent component capacities have a specific ratio.

3. Modern LSM Trees

3.1 Basic Structure

Modern LSM trees simplify the original design: multiple components are merged into a new file without modifying the old ones. A component may be a B+ tree, a skip‑list, or an ordered string table (SSTable). An SSTable contains a chain of data blocks and an index block; data blocks store records in key order, while the index block records the range of each block.

As inserts, updates, and deletes occur, the number of components grows, degrading read performance because queries must check more components. To mitigate this, components on disk are continuously compacted. Two common compaction policies are:

Leveling merge policy : each level holds a single component; the size of level L is T times that of level L‑1, so higher levels are merged less frequently, reducing the number of components and improving read performance.

Tiering merge policy : each level contains T components, lowering merge frequency and favoring write performance.

3.2 Optimization Techniques

Two widely used optimizations are Bloom filters and partitioning. A Bloom filter placed above a disk component can quickly reject non‑existent keys for point queries, avoiding unnecessary I/O. When the component is a B+ tree, the filter can also be applied to leaf nodes.

Partitioning splits each component into multiple smaller SSTables based on key ranges. This reduces the amount of data merged at a time and improves performance for sequential keys and skewed updates. Modern databases such as LevelDB and RocksDB adopt a partitioned leveling merge policy. During a merge from level L to level L+1, only overlapping SSTables in the target level need to be processed, while others remain untouched. LevelDB, for example, selects SSTables to merge using a round‑robin approach.

Partitioning can also be applied to tiering policies, but it introduces the need to maintain ordering of overlapping SSTables to ensure correctness. Two organization modes exist: vertical grouping (non‑overlapping key ranges across levels) and horizontal grouping (overlapping ranges within a group). The merging algorithm differs accordingly.

3.3 Concurrency Control and Recovery

Based on transaction isolation requirements, LSM‑based systems employ locking and multi‑version concurrency control (MVCC). Since LSM trees already store multiple versions, a metadata file tracks versions. Each component maintains a reference count; operations increment the count before use and decrement after, preventing premature deletion.

Writes are first logged in memory; a redo log guarantees consistency, while no‑steal memory management ensures uncommitted data is not flushed to disk, eliminating the need for undo logs. After a crash, valid components can be recovered by replaying metadata logs that record SSTable additions and deletions. For non‑partitioned LSM trees, timestamps on component intervals help reconstruct the component list; overlapping intervals are resolved by keeping the most recent component. Partitioned LSM trees rely on separate metadata logs (as in LevelDB and RocksDB) to rebuild the component list.

4. Conclusion

Because of their ultra‑high write performance and good space utilization, LSM tree structures are widely adopted in today’s NoSQL databases. With the rapid development of new hardware such as SSD/NVM, multi‑core CPUs, and Direct I/O, continual optimizations keep emerging, further accelerating the evolution of LSM‑based storage.

5. References

1. https://en.wikipedia.org/wiki/Log-structured_merge-tree

2. https://www.researchgate.net/publication/329772189_LSM-based_Storage_Techniques_A_Survey

3. https://researcher.watson.ibm.com/researcher/files/us-wtan/DiffIndex-EDBT14-CR.pdf

4. http://distributeddatastore.blogspot.com/2013/08/cassandra-sstable-storage-format.html

database compaction LSM‑Tree Storage Engine bloom filter NoSQL

Written by

Tencent Database Technology

Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.