Databases 16 min read

Unlocking HTAP: Inside TiDB, TiKV, RocksDB, and LevelDB Architecture

This article introduces TiDB’s hybrid transactional‑analytical processing capabilities, explains TiKV’s region‑based storage built on RocksDB, and provides a detailed overview of RocksDB and LevelDB architectures, including their data structures, write‑read flows, and key optimizations for modern distributed database systems.

Trip Tech Team

Aug 12, 2021

Unlocking HTAP: Inside TiDB, TiKV, RocksDB, and LevelDB Architecture

TiDB Introduction

TiDB is a hybrid transactional and analytical processing (HTAP) distributed database that supports online transaction processing (OLTP) and online analytical processing (OLAP) in a single system. It offers horizontal scaling via Multi‑Raft replication, financial‑grade high availability, real‑time HTAP, cloud‑native deployment, and MySQL 5.7 compatibility.

Overall Architecture

TiKV

TiKV is a persistent distributed key‑value store that provides transactional capabilities. Internally it uses RocksDB for on‑disk storage. Data is partitioned into Regions, each covering a key range, and a TiKV node manages multiple Regions.

Regions are distributed evenly across the cluster by the Placement Driver (PD), which also provides a single time source. Each Region has multiple replicas forming a Raft group for consistency.

With MVCC, a Region stores multiple versions of each key:

Key1_Version3 -> Value</code><code>Key1_Version2 -> Value</code><code>Key1_Version1 -> Value</code><code>…</code><code>Key2_Version4 -> Value</code><code>Key2_Version3 -> Value</code><code>Key2_Version2 -> Value</code><code>Key2_Version1 -> Value</code><code>…</code><code>KeyN_Version2 -> Value</code><code>KeyN_Version1 -> Value</code><code>…

RocksDB Introduction

RocksDB is a high‑performance key‑value storage engine developed by Facebook based on LevelDB. It is written in C++ with Java bindings, and serves as the storage engine for TiKV and many other systems.

RocksDB ReadMe

RocksDB targets flash and RAM, using an LSM tree to balance write amplification, read amplification, and space amplification. It supports multi‑threaded compaction and memtable inserts, and can handle terabytes of data in a single instance.

Improvements over LevelDB

Multithreaded compaction

Multithreaded memtable inserts

Reduced DB mutex holding

Optimized level‑based and universal compaction

Prefix bloom filter

Single bloom filter for the whole SST file

Write lock optimization

Improved Iter::Prev() performance

Fewer comparator calls during SkipList searches

Allocate memtable memory using huge pages

LevelDB Structure and Read/Write Process

LevelDB Introduction

LevelDB is a lightweight Google‑developed key‑value store with ordered keys, customizable comparators, atomic batch writes, snapshot reads, bidirectional iteration, and built‑in compression.

Keys and values are arbitrary byte arrays

Data is stored in key order

Custom comparators can be provided

Supports atomic batch writes

Supports snapshot reads

Supports bidirectional iteration

Provides data compression with LSM design

LevelDB Main Structure

LevelDB does not include a network component; it must be added by the user.

In Memory

Memtable : An in‑memory skip‑list storing ordered key‑value pairs, default size 4 MB, readable and writable.

Immutable : A read‑only version of a Memtable created when the active Memtable exceeds the size limit.

TableCache and BlockCache : Two‑level caching—TableCache holds SSTable metadata, while BlockCache stores actual data blocks, enabling efficient memory utilization and fine‑grained hot‑spot caching.

On Disk

SSTable (Sorted String Table) : Immutable, ordered on‑disk files originally from Google BigTable, optimized for sequential reads.

Log : Write‑ahead log (WAL) where writes are first appended before being applied to Memtable.

Manifest : Records the distribution of SST files across levels and their key ranges.

Current : Stores the name of the active Manifest file.

LevelDB Read/Write Process

Write Process

LevelDB provides three write APIs: Put, Delete, and Write (batch). Writes are batched for performance.

// Implementations of the DB interface</code><code>Status Put(const WriteOptions&, const Slice& key, const Slice& value) override;</code><code>Status Delete(const WriteOptions&, const Slice& key) override;</code><code>Status Write(const WriteOptions& options, WriteBatch* updates) override;</code><code>Status Get(const ReadOptions& options, const Slice& key, std::string* value) override;

Deletes are recorded as tombstone entries and removed during major compaction. Keys are stored as InternalKey containing user key, version, and operation type.

WriteBatch groups multiple operations, encoding them with variable‑length fields to save space.

Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates) {</code><code>  Writer w(&mutex_);</code><code>  w.batch = updates;</code><code>  w.sync = options.sync;</code><code>  w.done = false;</code><code>  MutexLock l(&mutex_);</code><code>  writers_.push_back(&w);</code><code>  while (!w.done && &w != writers_.front()) { w.cv.Wait(); }</code><code>  if (w.done) { return w.status; }</code><code>  Status status = MakeRoomForWrite(updates == nullptr);</code><code>  uint64_t last_sequence = versions_->LastSequence();</code><code>  Writer* last_writer = &w;</code><code>  if (status.ok() && updates != nullptr) {</code><code>    WriteBatch* write_batch = BuildBatchGroup(&last_writer);</code><code>    WriteBatchInternal::SetSequence(write_batch, last_sequence + 1);</code><code>    last_sequence += WriteBatchInternal::Count(write_batch);</code><code>    { mutex_.Unlock();</code><code>      status = log_->AddRecord(WriteBatchInternal::Contents(write_batch));</code><code>      if (status.ok() && options.sync) { status = logfile_->Sync(); }</code><code>      if (status.ok()) { status = WriteBatchInternal::InsertInto(write_batch, mem_); }</code><code>      mutex_.Lock();</code><code>    }</code><code>    versions_->SetLastSequence(last_sequence);</code><code>  }</code><code>  while (true) {</code><code>    Writer* ready = writers_.front();</code><code>    writers_.pop_front();</code><code>    if (ready != &w) { ready->status = status; ready->done = true; ready->cv.Signal(); }</code><code>    if (ready == last_writer) break;</code><code>  }</code><code>  if (!writers_.empty()) { writers_.front()->cv.Signal(); }</code><code>  return status;</code><code>}

Read Process

Read operations check Memtable, Immutable Memtable, Cache, and then SSTables from level 0 upward.

Status DBImpl::Get(const ReadOptions& options, const Slice& key, std::string* value) {</code><code>  MutexLock l(&mutex_);</code><code>  SequenceNumber snapshot = options.snapshot ? static_cast<const SnapshotImpl*>(options.snapshot)->sequence_number() : versions_->LastSequence();</code><code>  MemTable* mem = mem_; MemTable* imm = imm_; Version* current = versions_->current();</code><code>  mem->Ref(); if (imm) imm->Ref(); current->Ref();</code><code>  { mutex_.Unlock();</code><code>    LookupKey lkey(key, snapshot);</code><code>    if (mem->Get(lkey, value, &s)) { }</code><code>    else if (imm && imm->Get(lkey, value, &s)) { }</code><code>    else { s = current->Get(options, lkey, value, &stats); }</code><code>    mutex_.Lock();</code><code>  }</code><code>  mem->Unref(); if (imm) imm->Unref(); current->Unref();</code><code>  return s;</code><code>}

Conclusion

This article presented an overview of TiDB, TiKV, RocksDB, and LevelDB, highlighting their architectures, data structures, and read/write mechanisms. Understanding these foundational storage engines provides valuable insight into modern distributed databases and KV storage systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed database TiDB HTAP LevelDB KV storage RocksDB TiKV

Written by

Trip Tech Team

Trip.com technology team, focused on internationalization and localization solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.