Inside TiKV: MVCC Mechanics and Distributed Transaction Design
This article explains how TiKV implements multi-version concurrency control (MVCC) on top of RocksDB and details its two‑phase commit transaction model, including Prewrite and Commit phases, Percolator‑style optimizations, lock handling, conflict resolution, and garbage‑collection strategies.
TiKV, an open‑source distributed KV store derived from the Spanner paper, builds its transaction system on a distributed MVCC layer. MVCC stores multiple versions of each key (key, value, version) in RocksDB, an LSM‑tree based embedded database with excellent write performance and ordered key iteration.
The MVCC API exposed to upper layers includes:
MVCCGet(key, version) – returns the latest version ≤ the given timestamp.
MVCCScan(startKey, endKey, limit, version) – returns up to limit keys in the range whose latest version is ≤ the timestamp.
MVCCPut(key, value, version) – inserts or overwrites a version; the client must ensure monotonically increasing timestamps.
MVCCDelete(key, version) – removes a specific version, callable only by the GC module.
A typical MVCCGet works by first reading the meta key to locate the visible version, then fetching the corresponding data key. Because all related keys share a common prefix, read amplification remains acceptable.
When a key is updated very frequently, its meta key can become large. TiKV mitigates this by splitting the meta key into multiple smaller meta keys (Meta0, Meta1, …), each covering a sub‑range of versions, thus avoiding excessive read amplification.
Distributed Transaction Model
TiKV adopts a two‑phase commit (2PC) protocol inspired by Google’s Percolator rather than Spanner’s TrueTime‑based approach. The transaction proceeds in two stages:
Prewrite – the client selects a primary row and several secondary rows. It writes a lock (including the start timestamp) to the primary row after checking for existing locks or conflicting writes, then writes the data with the start timestamp. The same process is repeated for each secondary row.
Commit – after the client obtains a commit timestamp (guaranteed > start timestamp by the TSO service), it writes a new meta version for the primary row, removes the lock, and asynchronously commits secondary rows.
If any step in Prewrite fails, the transaction rolls back by deleting the lock and the tentative version. If the primary row fails to commit, the whole transaction aborts; otherwise, secondary rows may commit asynchronously, and their success does not affect the overall transaction outcome.
Lock Management and Conflict Handling
Locks are stored at two levels: primary and secondary. Once the primary lock is removed, the transaction is considered committed, allowing secondary commits to proceed in the background. Reads first check for a lock; if present and not expired, the read either waits or attempts lock cleanup to avoid phantom reads.
TiKV uses an optimistic transaction model: locks are only held during the final 2PC phase. To reduce the cost of frequent conflicts, a lightweight scheduler on each storage node queues lock‑contended operations briefly before returning a retry error, decreasing network overhead.
Garbage Collection (GC)
GC periodically removes obsolete versions. It cannot simply delete all versions before a safe point because some keys may have only a tombstone version that must be retained until no longer needed. TiKV also handles the case where a GC process encounters a lock on a key: it checks the primary key’s meta version to determine whether the transaction was committed or rolled back, ensuring that locked keys are not mistakenly deleted.
Overall, TiKV’s transaction system combines Percolator’s lock‑free 2PC design with engineering optimizations such as meta‑key splitting, a simple TSO service for monotonically increasing timestamps, and a scheduler‑based conflict mitigation strategy, providing Repeatable Read (SI) isolation with optional explicit locking for stronger guarantees.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
