Understanding LevelDB Architecture, Read/Write Flow, and Compaction Process
LevelDB stores data using an in‑memory Memtable that flushes to immutable tables and disk‑based SSTables, writes are logged then batched and applied through a writer queue, reads check Memtable, immutable Memtable, then SSTables, and background compactions merge tables to improve read performance and reclaim space.
LevelDB is a high‑performance single‑node storage engine based on the LSM‑Tree design. It optimizes write throughput by sacrificing some read speed, making it suitable for write‑heavy, read‑light workloads.
1. Overall Architecture
The engine consists of in‑memory structures (Memtable and Immutable Memtable) and on‑disk files (SSTable, Manifest, Log, and Current). Data is first written to the Log for durability, then to the Memtable. When the Memtable reaches a size threshold, it becomes Immutable and is later flushed to an SSTable on disk.
2. In‑Memory Cache Structures
Memtable : a writable SkipList that buffers recent writes.
Immutable Memtable : a read‑only SkipList created when the Memtable is full; a background thread compacts it into an SSTable.
3. On‑Disk Files
SSTable : ordered key‑value files stored on disk. Level‑0 files are created directly from Immutable Memtables and may have overlapping key ranges; higher levels contain non‑overlapping, sorted files.
Manifest : records metadata (key range, file size, etc.) for each SSTable.
Log : a write‑ahead log that guarantees durability across crashes.
Current : points to the active Manifest file.
4. Write Path
The public write APIs are Put and Delete . Internally, a WriteBatch is created and wrapped into a Writer object, which also holds a mutex and condition variable for synchronization.
Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
WriteBatch batch;
batch.Put(key, value);
// call DBImpl::Write
return Write(opt, &batch);
}
struct DBImpl::Writer {
Status status;
WriteBatch* batch;
bool sync;
bool done;
port::CondVar cv;
explicit Writer(port::Mutex* mu) : cv(mu) { }
};Writers are enqueued in writers_ . The thread at the head of the queue performs the following steps:
Acquire space via MakeRoomForWrite , converting a full Memtable to Immutable and allocating a new Memtable.
Batch together subsequent Writers from the queue ( BuildBatchGroup ) to perform a single disk write.
Append the batch to the Log and insert it into the Memtable.
Wake up waiting threads and remove processed Writers from the queue.
Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates) {
Writer w(&mutex_);
w.batch = updates;
w.sync = options.sync;
w.done = false;
MutexLock l(&mutex_);
writers_.push_back(&w);
while (!w.done && &w != writers_.front()) {
w.cv.Wait();
}
if (w.done) return w.status;
// ... (write logic) ...
}5. Read Path
Reads follow a simple hierarchy: Memtable → Immutable Memtable → SSTable. The engine first checks the newest Memtable, then the Immutable one, and finally searches the appropriate SSTable level using binary search (except for Level‑0 where files may overlap).
mutex_.Unlock();
LookupKey lkey(key, snapshot);
if (mem->Get(lkey, value, &s)) {
// found in memtable
} else if (imm != nullptr && imm->Get(lkey, value, &s)) {
// found in immutable memtable
} else {
s = current->Get(options, lkey, value, &stats); // search SSTable
have_stat_update = true;
}
mutex_.Lock();6. Compaction Process
Compaction reorganizes data to improve read efficiency and reclaim space. There are two types:
Minor compaction : flushes an Immutable Memtable to a Level‑0 SSTable. It is triggered when the Memtable size exceeds a configured threshold.
Major compaction : merges SSTables across levels, reducing file count and eliminating deleted/overwritten entries. It is triggered based on file count (Level‑0) or total size (higher levels) and also on excessive seek misses.
Compaction scoring determines which level to compact:
Level‑0 score = number_of_files / 4.
Level‑i score = total_file_size / (10^i) MB.
void VersionSet::Finalize(Version* v) {
int best_level = -1;
double best_score = -1;
for (int level = 0; level < config::kNumLevels - 1; level++) {
double score;
if (level == 0) {
score = v->files_[level].size() / static_cast
(config::kL0_CompactionTrigger);
} else {
uint64_t level_bytes = TotalFileSize(v->files_[level]);
score = static_cast
(level_bytes) / MaxBytesForLevel(options_, level);
}
if (score > best_score) {
best_level = level;
best_score = score;
}
}
v->compaction_level_ = best_level;
v->compaction_score_ = best_score;
}During major compaction, overlapping files from the selected level and the next level are merged using a multi‑way iterator to produce a new, sorted set of SSTables.
Iterator* VersionSet::MakeInputIterator(Compaction* c) {
const int space = (c->level() == 0 ? c->inputs_[0].size() + 1 : 2);
Iterator** list = new Iterator*[space];
int num = 0;
for (int which = 0; which < 2; which++) {
if (!c->inputs_[which].empty()) {
if (c->level() + which == 0) {
const std::vector
& files = c->inputs_[which];
for (size_t i = 0; i < files.size(); i++) {
list[num++] = table_cache_->NewIterator(options, files[i]->number, files[i]->file_size);
}
} else {
list[num++] = NewTwoLevelIterator(
new Version::LevelFileNumIterator(icmp_, &c->inputs_[which]),
&GetFileIterator, table_cache_, options);
}
}
}
// ... combine iterators ...
}The article concludes with diagrams illustrating the write queue, the read hierarchy, and the compaction flow.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.