Unlocking HTAP: Inside TiDB, TiKV, RocksDB, and LevelDB Architecture
This article introduces TiDB’s hybrid transactional‑analytical processing capabilities, explains TiKV’s region‑based storage built on RocksDB, and provides a detailed overview of RocksDB and LevelDB architectures, including their data structures, write‑read flows, and key optimizations for modern distributed database systems.
TiDB Introduction
TiDB is a hybrid transactional and analytical processing (HTAP) distributed database that supports online transaction processing (OLTP) and online analytical processing (OLAP) in a single system. It offers horizontal scaling via Multi‑Raft replication, financial‑grade high availability, real‑time HTAP, cloud‑native deployment, and MySQL 5.7 compatibility.
Overall Architecture
TiKV
TiKV is a persistent distributed key‑value store that provides transactional capabilities. Internally it uses RocksDB for on‑disk storage. Data is partitioned into Regions, each covering a key range, and a TiKV node manages multiple Regions.
Regions are distributed evenly across the cluster by the Placement Driver (PD), which also provides a single time source. Each Region has multiple replicas forming a Raft group for consistency.
With MVCC, a Region stores multiple versions of each key:
<code>Key1_Version3 -> Value</code><code>Key1_Version2 -> Value</code><code>Key1_Version1 -> Value</code><code>…</code><code>Key2_Version4 -> Value</code><code>Key2_Version3 -> Value</code><code>Key2_Version2 -> Value</code><code>Key2_Version1 -> Value</code><code>…</code><code>KeyN_Version2 -> Value</code><code>KeyN_Version1 -> Value</code><code>…</code>RocksDB Introduction
RocksDB is a high‑performance key‑value storage engine developed by Facebook based on LevelDB. It is written in C++ with Java bindings, and serves as the storage engine for TiKV and many other systems.
RocksDB ReadMe
RocksDB targets flash and RAM, using an LSM tree to balance write amplification, read amplification, and space amplification. It supports multi‑threaded compaction and memtable inserts, and can handle terabytes of data in a single instance.
Improvements over LevelDB
Multithreaded compaction
Multithreaded memtable inserts
Reduced DB mutex holding
Optimized level‑based and universal compaction
Prefix bloom filter
Single bloom filter for the whole SST file
Write lock optimization
Improved Iter::Prev() performance
Fewer comparator calls during SkipList searches
Allocate memtable memory using huge pages
LevelDB Structure and Read/Write Process
LevelDB Introduction
LevelDB is a lightweight Google‑developed key‑value store with ordered keys, customizable comparators, atomic batch writes, snapshot reads, bidirectional iteration, and built‑in compression.
Keys and values are arbitrary byte arrays
Data is stored in key order
Custom comparators can be provided
Supports atomic batch writes
Supports snapshot reads
Supports bidirectional iteration
Provides data compression with LSM design
LevelDB Main Structure
LevelDB does not include a network component; it must be added by the user.
In Memory
Memtable : An in‑memory skip‑list storing ordered key‑value pairs, default size 4 MB, readable and writable.
Immutable : A read‑only version of a Memtable created when the active Memtable exceeds the size limit.
TableCache and BlockCache : Two‑level caching—TableCache holds SSTable metadata, while BlockCache stores actual data blocks, enabling efficient memory utilization and fine‑grained hot‑spot caching.
On Disk
SSTable (Sorted String Table) : Immutable, ordered on‑disk files originally from Google BigTable, optimized for sequential reads.
Log : Write‑ahead log (WAL) where writes are first appended before being applied to Memtable.
Manifest : Records the distribution of SST files across levels and their key ranges.
Current : Stores the name of the active Manifest file.
LevelDB Read/Write Process
Write Process
LevelDB provides three write APIs: Put , Delete , and Write (batch). Writes are batched for performance.
<code>// Implementations of the DB interface</code><code>Status Put(const WriteOptions&, const Slice& key, const Slice& value) override;</code><code>Status Delete(const WriteOptions&, const Slice& key) override;</code><code>Status Write(const WriteOptions& options, WriteBatch* updates) override;</code><code>Status Get(const ReadOptions& options, const Slice& key, std::string* value) override;</code>Deletes are recorded as tombstone entries and removed during major compaction. Keys are stored as InternalKey containing user key, version, and operation type.
WriteBatch groups multiple operations, encoding them with variable‑length fields to save space.
<code>Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates) {</code><code> Writer w(&mutex_);</code><code> w.batch = updates;</code><code> w.sync = options.sync;</code><code> w.done = false;</code><code> MutexLock l(&mutex_);</code><code> writers_.push_back(&w);</code><code> while (!w.done && &w != writers_.front()) { w.cv.Wait(); }</code><code> if (w.done) { return w.status; }</code><code> Status status = MakeRoomForWrite(updates == nullptr);</code><code> uint64_t last_sequence = versions_->LastSequence();</code><code> Writer* last_writer = &w;</code><code> if (status.ok() && updates != nullptr) {</code><code> WriteBatch* write_batch = BuildBatchGroup(&last_writer);</code><code> WriteBatchInternal::SetSequence(write_batch, last_sequence + 1);</code><code> last_sequence += WriteBatchInternal::Count(write_batch);</code><code> { mutex_.Unlock();</code><code> status = log_->AddRecord(WriteBatchInternal::Contents(write_batch));</code><code> if (status.ok() && options.sync) { status = logfile_->Sync(); }</code><code> if (status.ok()) { status = WriteBatchInternal::InsertInto(write_batch, mem_); }</code><code> mutex_.Lock();</code><code> }</code><code> versions_->SetLastSequence(last_sequence);</code><code> }</code><code> while (true) {</code><code> Writer* ready = writers_.front();</code><code> writers_.pop_front();</code><code> if (ready != &w) { ready->status = status; ready->done = true; ready->cv.Signal(); }</code><code> if (ready == last_writer) break;</code><code> }</code><code> if (!writers_.empty()) { writers_.front()->cv.Signal(); }</code><code> return status;</code><code>}</code>Read Process
Read operations check Memtable, Immutable Memtable, Cache, and then SSTables from level 0 upward.
<code>Status DBImpl::Get(const ReadOptions& options, const Slice& key, std::string* value) {</code><code> MutexLock l(&mutex_);</code><code> SequenceNumber snapshot = options.snapshot ? static_cast<const SnapshotImpl*>(options.snapshot)->sequence_number() : versions_->LastSequence();</code><code> MemTable* mem = mem_; MemTable* imm = imm_; Version* current = versions_->current();</code><code> mem->Ref(); if (imm) imm->Ref(); current->Ref();</code><code> { mutex_.Unlock();</code><code> LookupKey lkey(key, snapshot);</code><code> if (mem->Get(lkey, value, &s)) { }</code><code> else if (imm && imm->Get(lkey, value, &s)) { }</code><code> else { s = current->Get(options, lkey, value, &stats); }</code><code> mutex_.Lock();</code><code> }</code><code> mem->Unref(); if (imm) imm->Unref(); current->Unref();</code><code> return s;</code><code>}</code>Conclusion
This article presented an overview of TiDB, TiKV, RocksDB, and LevelDB, highlighting their architectures, data structures, and read/write mechanisms. Understanding these foundational storage engines provides valuable insight into modern distributed databases and KV storage systems.
Trip Tech Team
Trip.com technology team, focused on internationalization and localization solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.