Databases 16 min read

Understanding LevelDB Architecture, Read/Write Flow, and Compaction Process

LevelDB stores data using an in‑memory Memtable that flushes to immutable tables and disk‑based SSTables, writes are logged then batched and applied through a writer queue, reads check Memtable, immutable Memtable, then SSTables, and background compactions merge tables to improve read performance and reclaim space.

Tencent Cloud Developer

Dec 10, 2020

Understanding LevelDB Architecture, Read/Write Flow, and Compaction Process

LevelDB is a high‑performance single‑node storage engine based on the LSM‑Tree design. It optimizes write throughput by sacrificing some read speed, making it suitable for write‑heavy, read‑light workloads.

1. Overall Architecture

The engine consists of in‑memory structures (Memtable and Immutable Memtable) and on‑disk files (SSTable, Manifest, Log, and Current). Data is first written to the Log for durability, then to the Memtable. When the Memtable reaches a size threshold, it becomes Immutable and is later flushed to an SSTable on disk.

2. In‑Memory Cache Structures

Memtable : a writable SkipList that buffers recent writes.

Immutable Memtable : a read‑only SkipList created when the Memtable is full; a background thread compacts it into an SSTable.

3. On‑Disk Files

SSTable : ordered key‑value files stored on disk. Level‑0 files are created directly from Immutable Memtables and may have overlapping key ranges; higher levels contain non‑overlapping, sorted files.

Manifest : records metadata (key range, file size, etc.) for each SSTable.

Log : a write‑ahead log that guarantees durability across crashes.

Current : points to the active Manifest file.

4. Write Path

The public write APIs are Put and Delete. Internally, a WriteBatch is created and wrapped into a Writer object, which also holds a mutex and condition variable for synchronization.

Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
  WriteBatch batch;
  batch.Put(key, value);
  // call DBImpl::Write
  return Write(opt, &batch);
}

struct DBImpl::Writer {
  Status status;
  WriteBatch* batch;
  bool sync;
  bool done;
  port::CondVar cv;
  explicit Writer(port::Mutex* mu) : cv(mu) { }
};

Writers are enqueued in writers_. The thread at the head of the queue performs the following steps:

Acquire space via MakeRoomForWrite, converting a full Memtable to Immutable and allocating a new Memtable.

Batch together subsequent Writers from the queue ( BuildBatchGroup) to perform a single disk write.

Append the batch to the Log and insert it into the Memtable.

Wake up waiting threads and remove processed Writers from the queue.

Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates) {
  Writer w(&mutex_);
  w.batch = updates;
  w.sync = options.sync;
  w.done = false;
  MutexLock l(&mutex_);
  writers_.push_back(&w);
  while (!w.done && &w != writers_.front()) {
    w.cv.Wait();
  }
  if (w.done) return w.status;
  // ... (write logic) ...
}

5. Read Path

Reads follow a simple hierarchy: Memtable → Immutable Memtable → SSTable. The engine first checks the newest Memtable, then the Immutable one, and finally searches the appropriate SSTable level using binary search (except for Level‑0 where files may overlap).

mutex_.Unlock();
LookupKey lkey(key, snapshot);
if (mem->Get(lkey, value, &s)) {
  // found in memtable
} else if (imm != nullptr && imm->Get(lkey, value, &s)) {
  // found in immutable memtable
} else {
  s = current->Get(options, lkey, value, &stats); // search SSTable
  have_stat_update = true;
}
mutex_.Lock();

6. Compaction Process

Compaction reorganizes data to improve read efficiency and reclaim space. There are two types:

Minor compaction : flushes an Immutable Memtable to a Level‑0 SSTable. It is triggered when the Memtable size exceeds a configured threshold.

Major compaction : merges SSTables across levels, reducing file count and eliminating deleted/overwritten entries. It is triggered based on file count (Level‑0) or total size (higher levels) and also on excessive seek misses.

Compaction scoring determines which level to compact:

Level‑0 score = number_of_files / 4.

Level‑i score = total_file_size / (10^i) MB.

void VersionSet::Finalize(Version* v) {
  int best_level = -1;
  double best_score = -1;
  for (int level = 0; level < config::kNumLevels - 1; level++) {
    double score;
    if (level == 0) {
      score = v->files_[level].size() / static_cast<double>(config::kL0_CompactionTrigger);
    } else {
      uint64_t level_bytes = TotalFileSize(v->files_[level]);
      score = static_cast<double>(level_bytes) / MaxBytesForLevel(options_, level);
    }
    if (score > best_score) {
      best_level = level;
      best_score = score;
    }
  }
  v->compaction_level_ = best_level;
  v->compaction_score_ = best_score;
}

During major compaction, overlapping files from the selected level and the next level are merged using a multi‑way iterator to produce a new, sorted set of SSTables.

Iterator* VersionSet::MakeInputIterator(Compaction* c) {
  const int space = (c->level() == 0 ? c->inputs_[0].size() + 1 : 2);
  Iterator** list = new Iterator*[space];
  int num = 0;
  for (int which = 0; which < 2; which++) {
    if (!c->inputs_[which].empty()) {
      if (c->level() + which == 0) {
        const std::vector<FileMetaData*>& files = c->inputs_[which];
        for (size_t i = 0; i < files.size(); i++) {
          list[num++] = table_cache_->NewIterator(options, files[i]->number, files[i]->file_size);
        }
      } else {
        list[num++] = NewTwoLevelIterator(
            new Version::LevelFileNumIterator(icmp_, &c->inputs_[which]),
            &GetFileIterator, table_cache_, options);
      }
    }
  }
  // ... combine iterators ...
}

The article concludes with diagrams illustrating the write queue, the read hierarchy, and the compaction flow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Compaction LSM‑Tree Storage Engine LevelDB Database Internals Read Path Write Path

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.