Databases 25 min read

B‑Tree vs LSM‑Tree: Which Storage Engine Fits Your Database Workload?

This article examines the fundamental differences between B‑Tree and LSM‑Tree storage structures in distributed and relational databases, detailing their write and read paths, performance trade‑offs, update handling, lock conflicts, and high‑availability considerations to help engineers choose the right engine for their workloads.

dbaplus Community
dbaplus Community
dbaplus Community
B‑Tree vs LSM‑Tree: Which Storage Engine Fits Your Database Workload?

Introduction

Choosing between B‑Tree and LSM‑Tree storage engines has a direct impact on write latency, read latency, and overall system scalability. Understanding their architectural differences is essential when evaluating distributed relational databases or key‑value stores.

Background – Distributed Databases

Typical distributed relational databases (e.g., MySQL, PostgreSQL, TiDB) still rely on B‑Tree indexes, while many modern distributed NoSQL systems (Cassandra, ScyllaDB, RocksDB, LevelDB) are built on LSM‑Tree structures. The distributed context adds extra layers such as SQL parsing, distributed query planning, and global transaction coordination, which must be considered alongside the storage engine characteristics.

Write Path of LSM‑Tree

LSM‑Tree achieves write‑friendliness through a three‑stage pipeline:

WAL (Write‑Ahead Log) write – All engines append a sequential log (e.g., InnoDB redo log, Cassandra commit log) to durable storage to guarantee recovery after a crash.

In‑memory structure update – The log entry is applied to a MemTable (LSM) or to B‑Tree nodes (B‑Tree). MemTable updates are pure appends; they never require a read‑before‑write because the data is not yet on disk.

Asynchronous flush – The MemTable is periodically compacted into immutable SSTable files. This flush runs in the background, so the client transaction returns before the data reaches disk.

In contrast, B‑Tree writes often trigger a read of the target page when it is not cached, leading to additional random I/O.

Update Handling

LSM‑Tree treats UPDATE as an out‑of‑place INSERT. The new version is written to the MemTable and later merged during compaction, avoiding the read‑modify‑write cycle required by in‑place B‑Tree updates. The 2020 VLDB survey diagram (shown below) contrasts in‑place (B‑Tree) and out‑of‑place (LSM‑Tree) update strategies.

Read Path of B‑Tree

B‑Tree point queries traverse a balanced tree, typically requiring a single disk read (or none if the page is cached). LSM‑Tree reads may need to probe several SSTable levels, but two techniques mitigate the cost:

Bloom filters attached to each SSTable quickly reject non‑existent keys.

Cache hierarchy (block cache, MemTable) often satisfies recent reads without disk access.

Resource Conflicts – Locking in Distributed LSM‑Tree Systems

Lock management differs markedly:

Some LSM‑Tree products (e.g., Cassandra) avoid traditional locks by using timestamp‑based conflict resolution. Writes are appended; the latest timestamp wins during read‑time merge.

Other systems implement lock‑free mechanisms such as Percolator’s three‑column‑family design (data, lock info, version). The lock column stores a short lock record; reads check this column before applying a write.

High‑Availability Overhead

Replication logs such as MySQL BINLOG or Raft logs add an unavoidable write path element. Although not part of the storage engine, they must be flushed before a transaction is considered committed, introducing extra latency comparable to the WAL stage.

Practical Guidance

When evaluating a workload, consider the following heuristics:

Write‑heavy, append‑only workloads (e.g., event logs, click‑stream tables) benefit from LSM‑Tree’s sequential writes and low‑latency commits.

Read‑intensive point‑lookup workloads with strong consistency requirements favor B‑Tree due to fewer disk seeks per query.

Update‑heavy workloads with unique‑key checks or pessimistic locking may erode LSM‑Tree’s advantage because additional reads (for lock checks or uniqueness validation) re‑introduce random I/O.

Distributed environments should factor in extra coordination cost (timestamp ordering, lock propagation) that can offset LSM‑Tree’s write gains.

Empirical benchmarking on the target data size, cache configuration, and compaction policy (size‑tiered vs. leveled) remains the most reliable method to decide between the two structures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LSM‑TreeWrite PerformanceDatabase StorageB-TreeRead Performance
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.