Why Modern Databases Prefer LSM Trees Over B‑Trees: Hardware, Workloads, and More
Modern databases have largely shifted from B‑tree based storage to LSM‑tree engines due to SSD hardware characteristics, high‑write workloads, concurrency advantages, simpler implementation, and evolving application demands, with additional insights into Paxos/Raft consensus, common database jargon, and performance optimizations.
Almost all modern databases now use storage engines based on LSM trees, such as RocksDB, CockroachDB, TiDB, FoundationDB, Snowflake, Doris, OceanBase, and InfluxDB. Earlier systems like MySQL, PostgreSQL, and Oracle relied on B‑tree variants. This shift is driven by two major factors: hardware evolution and changing usage scenarios.
Hardware Evolution
SSD drives have become ubiquitous, offering dramatically higher performance than mechanical disks. However, SSDs cannot efficiently perform in‑place updates; they require erase‑then‑write cycles at the block (128 pages) and page (4 KB) granularity. Modifying a single byte on an SSD may involve reading, erasing, and rewriting an entire 512 KB block, causing write amplification far beyond that of HDDs and limiting write endurance.
The append‑only nature of LSM trees avoids in‑place modifications, aligning well with SSD write characteristics. Additionally, modern CPUs can no longer gain performance solely by increasing clock frequencies, making concurrent, lock‑free data structures more valuable. LSM trees, with their read‑only file segments, naturally support high concurrency without the fine‑grained locking required by B‑trees.
Changing Workloads
The rise of big‑data workloads—driven by Google’s seminal papers on HDFS, MapReduce, and Bigtable—has led to scenarios where write volume far exceeds reads. LSM trees’ append‑only design provides extremely high write throughput, and with appropriate compaction strategies, read performance remains competitive.
Implementation complexity also favors LSM trees: a functional LSM engine can be written relatively easily, whereas a correct B‑tree implementation is considerably more challenging. Open‑source projects like LevelDB and RocksDB (originating from Google and Facebook) have accelerated adoption.
Paxos and Raft in Modern Databases
Paxos and Raft are consensus algorithms used primarily for log replication across replicas, ensuring that all nodes apply the same sequence of operations. Unlike asynchronous message queues, these algorithms perform synchronous writes, guaranteeing consistency even when some replicas fail. They define quorum rules (e.g., 2 of 3 replicas) to balance latency and fault tolerance.
In practice, systems such as TiDB and OceanBase use Paxos/Raft to achieve high availability and strong consistency across distributed nodes.
Common Database Jargon
Predicate push‑down and projection push‑down refer to moving filter (WHERE) and column‑selection (SELECT) operations from the compute layer down to the storage layer, reducing data movement in distributed architectures.
Vector engine leverages CPU SIMD instructions to perform operations on whole arrays (vectors) in a single instruction, dramatically speeding up columnar computations.
Bypass describes accessing hardware directly, bypassing the operating system kernel—e.g., using user‑space RDMA or direct disk I/O to avoid kernel buffering and scheduling overhead.
Illustrations
In summary, modern hardware characteristics and write‑heavy workloads have driven the adoption of LSM trees, which naturally complement SSD behavior and provide superior concurrency. Consensus protocols like Paxos and Raft ensure consistent log replication across replicas, while emerging terminology such as predicate push‑down, vector engines, and bypass reflect ongoing optimizations in database design.
G7 EasyFlow Tech Circle
Official G7 EasyFlow tech channel! All the hardcore tech, cutting‑edge innovations, and practical sharing you want are right here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
