B‑Tree vs LSM‑Tree: Which Storage Engine Fits Your Database Workload?
This article examines the fundamental differences between B‑Tree and LSM‑Tree storage structures in distributed and relational databases, detailing their write and read paths, performance trade‑offs, update handling, lock conflicts, and high‑availability considerations to help engineers choose the right engine for their workloads.
Introduction
Choosing between B‑Tree and LSM‑Tree storage engines has a direct impact on write latency, read latency, and overall system scalability. Understanding their architectural differences is essential when evaluating distributed relational databases or key‑value stores.
Background – Distributed Databases
Typical distributed relational databases (e.g., MySQL, PostgreSQL, TiDB) still rely on B‑Tree indexes, while many modern distributed NoSQL systems (Cassandra, ScyllaDB, RocksDB, LevelDB) are built on LSM‑Tree structures. The distributed context adds extra layers such as SQL parsing, distributed query planning, and global transaction coordination, which must be considered alongside the storage engine characteristics.
Write Path of LSM‑Tree
LSM‑Tree achieves write‑friendliness through a three‑stage pipeline:
WAL (Write‑Ahead Log) write – All engines append a sequential log (e.g., InnoDB redo log, Cassandra commit log) to durable storage to guarantee recovery after a crash.
In‑memory structure update – The log entry is applied to a MemTable (LSM) or to B‑Tree nodes (B‑Tree). MemTable updates are pure appends; they never require a read‑before‑write because the data is not yet on disk.
Asynchronous flush – The MemTable is periodically compacted into immutable SSTable files. This flush runs in the background, so the client transaction returns before the data reaches disk.
In contrast, B‑Tree writes often trigger a read of the target page when it is not cached, leading to additional random I/O.
Update Handling
LSM‑Tree treats UPDATE as an out‑of‑place INSERT. The new version is written to the MemTable and later merged during compaction, avoiding the read‑modify‑write cycle required by in‑place B‑Tree updates. The 2020 VLDB survey diagram (shown below) contrasts in‑place (B‑Tree) and out‑of‑place (LSM‑Tree) update strategies.
Read Path of B‑Tree
B‑Tree point queries traverse a balanced tree, typically requiring a single disk read (or none if the page is cached). LSM‑Tree reads may need to probe several SSTable levels, but two techniques mitigate the cost:
Bloom filters attached to each SSTable quickly reject non‑existent keys.
Cache hierarchy (block cache, MemTable) often satisfies recent reads without disk access.
Resource Conflicts – Locking in Distributed LSM‑Tree Systems
Lock management differs markedly:
Some LSM‑Tree products (e.g., Cassandra) avoid traditional locks by using timestamp‑based conflict resolution. Writes are appended; the latest timestamp wins during read‑time merge.
Other systems implement lock‑free mechanisms such as Percolator’s three‑column‑family design (data, lock info, version). The lock column stores a short lock record; reads check this column before applying a write.
High‑Availability Overhead
Replication logs such as MySQL BINLOG or Raft logs add an unavoidable write path element. Although not part of the storage engine, they must be flushed before a transaction is considered committed, introducing extra latency comparable to the WAL stage.
Practical Guidance
When evaluating a workload, consider the following heuristics:
Write‑heavy, append‑only workloads (e.g., event logs, click‑stream tables) benefit from LSM‑Tree’s sequential writes and low‑latency commits.
Read‑intensive point‑lookup workloads with strong consistency requirements favor B‑Tree due to fewer disk seeks per query.
Update‑heavy workloads with unique‑key checks or pessimistic locking may erode LSM‑Tree’s advantage because additional reads (for lock checks or uniqueness validation) re‑introduce random I/O.
Distributed environments should factor in extra coordination cost (timestamp ordering, lock propagation) that can offset LSM‑Tree’s write gains.
Empirical benchmarking on the target data size, cache configuration, and compaction policy (size‑tiered vs. leveled) remains the most reliable method to decide between the two structures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
