Can Separating Keys and Values Boost LSM Performance on SSDs?
This article examines the trade‑offs of LSM‑based storage engines on SSDs, highlighting write amplification issues, the benefits of separating keys from values via the WiscKey approach, and the challenges of range queries, garbage collection, and crash consistency.
Background Introduction
In recent years, LSM (Log‑Structured Merge‑Tree) storage engines such as LevelDB and RocksDB have become the storage foundation for many distributed components because of their excellent write performance and respectable read performance. Projects like Pika (a Redis‑compatible large‑capacity store) and Zeppelin (a distributed KV store) rely on them.
However, LSM suffers from drawbacks in large‑value scenarios, including poor performance for big values and repeated disk rewrites. LevelDB turns random disk writes into sequential writes, but with the widespread adoption of SSDs, it is unclear whether this design still yields significant benefits.
Problem
LSM trees convert random writes into sequential writes, achieving high write throughput at the cost of write amplification—where the actual disk writes far exceed the user‑requested writes. In the worst case, write amplification can reach tens to hundreds of times. While traditional HDDs benefit enormously from sequential writes (up to a thousand‑fold speedup over random writes), SSDs already have strong random‑write capabilities, reducing the advantage of LSM and increasing wear on the SSD.
Idea
When data values become large, LSM’s drawbacks become pronounced. The key insight is that LSM only requires ordered keys; values can be stored separately. The proposed solution is key‑value separation:
Append values to a separate log (vLog) and store only the key and the value’s address in the LSM.
Delete operations remove the key from the LSM; the obsolete value is reclaimed later.
Read operations fetch the address from the LSM and retrieve the value from the vLog.
Benefits
Avoid moving invalid values during compaction, dramatically reducing read/write amplification.
Significantly shrink the LSM size, improving cache effectiveness.
Challenges
Key‑value separation introduces three main challenges:
Inefficient range queries: Range scans become a mix of sequential reads and multiple random reads. SSD parallel I/O can mitigate this, but the overhead remains.
Value space reclamation: Deleted or expired values remain in the log and must be reclaimed asynchronously. Offline reclamation (mark‑and‑sweep) can cause load spikes, while online reclamation (as in WiscKey) requires careful design.
Crash consistency: Separating keys and values can lead to inconsistent states after a crash. The system must ensure atomicity of key‑value writes and recoverability of the vLog.
Optimization
Introduce a Value‑Log write buffer to coalesce short values into longer ones, improving disk throughput. Additionally, periodically update the vLog head in the LSM, allowing crash recovery to start from the head and traverse the vLog safely.
Summary
WiscKey is not a universal solution; it incurs extra complexity and performance penalties for small values. It is most effective when values are much larger than keys, a scenario that aligns with our Zeppelin project’s S3‑backed storage needs, making WiscKey a promising direction for future engine development.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.