Percolator Distributed Transaction Architecture and Its Implementation in TiKV
Percolator implements a two‑phase commit transaction protocol with a client coordinator, a timestamp oracle, and storage (Bigtable or TiKV’s RocksDB), providing snapshot‑isolated ACID semantics via lock, write, and data columns; TiKV adds parallel prewrite, short‑value storage, point‑read shortcuts, calculated commit timestamps, and single‑region one‑phase commits to boost performance while keeping the design simple and scalable, though high contention can cause retries and read‑wait delays.
Background
Percolator is a distributed transaction solution proposed by Google in the 2010 paper "Large‑scale Incremental Processing Using Distributed Transactions and Notifications". The paper introduced Percolator to solve the incremental indexing problem of a search engine.
Percolator provides ACID semantics and implements the Snapshot Isolation isolation level, making it a general‑purpose distributed transaction protocol. It is built on top of Google’s Bigtable and essentially follows a two‑phase commit (2PC) protocol that leverages Bigtable’s row‑level transactions.
Architecture
Percolator consists of three components:
Client : the coordinator of the 2PC protocol, responsible for initiating and committing transactions.
Timestamp Oracle (TSO) : a global service that issues unique, monotonically increasing timestamps.
Bigtable : the underlying distributed storage that persists data.
2.1 Client
The client acts as the coordinator in the two‑phase commit, issuing prewrite and commit requests.
2.2 Timestamp Oracle (TSO)
All transactions obtain a start timestamp from TSO and later a commit timestamp. These timestamps are used to enforce Snapshot Isolation.
2.3 Bigtable
Bigtable stores rows as a multi‑dimensional ordered map with the key format:
(row:string, column:string, timestamp:int64) -> stringThe key is a triple (row, column, timestamp); the value is a byte array. Bigtable provides per‑row atomic operations, which Percolator uses to guarantee atomic updates of multiple columns within the same row.
Percolator stores three special columns:
c:lock – lock records written during the prewrite phase.
c:write – commit records written during the commit phase.
c:data – the actual user data.
2.4 Snapshot Isolation
All reads see a consistent snapshot (equivalent to the REPEATABLE READ isolation level).
If two concurrent transactions write the same cell, only one can commit.
During commit, if a transaction discovers that any of its written data has been overwritten by a later‑started transaction, it rolls back; otherwise it commits.
Write‑skew can occur because Snapshot Isolation does not guarantee full serializability, but it offers better read performance.
3 Transaction Processing
3.1 Write Logic (Prewrite & Commit)
Percolator uses a two‑phase commit:
Prewrite phase : Obtain a start timestamp from TSO. For each row to be written, write the start_ts to the c:lock column and the new value (with start_ts) to the c:data column. One lock is chosen as the primary lock; the others are secondary locks that reference the primary.
Commit phase : Obtain a commit timestamp from TSO. Delete the primary lock and write the commit_ts to the c:write column atomically. If the primary lock is missing, the commit fails. Repeat the same steps for all secondary locks.
The article illustrates the process with a classic bank‑transfer example (Bob → Joe).
3.2 Read Logic
Obtain a read timestamp ts.
Check whether any lock exists with a timestamp ≤ ts. If such a lock exists, the read must wait until the lock is released.
From the c:write column, find the record with the largest commit_ts ≤ ts and retrieve its start_ts.
Use the start_ts to read the corresponding value from the c:data column.
3.3 Handling Client Crashes
If a client crashes during commit, its locks may remain. Percolator adopts a lazy lock‑cleaning strategy: when another transaction encounters a leftover lock, it attempts to determine whether the original client has truly failed. The primary lock serves as the synchronization point; if the primary lock is absent and a commit record exists, the transaction is considered committed.
To reliably detect crashes, Percolator uses the Chubby lock service to store the liveness of each client.
4 Implementation and Optimizations in TiKV
TiKV uses RocksDB as its storage engine. RocksDB provides atomic write batches, which satisfy Percolator’s row‑transaction requirements.
4.1 RocksDB Column Families
CF_DEFAULT : stores user data (key, start_ts) → value.
CF_LOCK : stores lock information (key → lock_info).
CF_WRITE : stores commit information (key, commit_ts) → write_info.
Keys are encoded as a combination of the user key (in memcomparable format) and the timestamp (bit‑wise inverted, big‑endian). Example encoding for key "key1" with timestamp 3:
key1\x00\x00\x00\x00\xfb\xff\xff\xff\xff\xff\xff\xff\xfe4.2 Optimizations
Parallel Prewrite : When a transaction spans multiple TiKV nodes, prewrite operations are executed in parallel across batches, without requiring the primary key to be the first successful prewrite.
Short Value in Write Column : Small values are stored directly in the CF_LOCK column during prewrite and moved to CF_WRITE at commit, avoiding an extra RocksDB lookup.
Point Read Without Timestamp : For single‑key reads, a start_ts is unnecessary; the latest version can be read directly.
Calculated Commit Timestamp : Commit timestamps must satisfy max(start_ts, max_read_ts_of_written_keys) < commit_ts ≤ now . The calculation is:
commit_ts = max(start_ts, region_1_max_read_ts, region_2_max_read_ts, ...) + 1where region_N_max_read_ts is the maximum read timestamp observed in region N.
Single‑Region 1PC : If all keys of a transaction reside in a single region and there is no write conflict, the transaction can be committed with a single‑phase commit, bypassing the full 2PC protocol.
5 Summary
Advantages
Transaction management is built on top of the storage layer, resulting in a clear architecture and good scalability.
Performance is good in low‑conflict workloads.
Disadvantages
Performance degrades sharply under high conflict due to repeated retries.
MVCC can cause read‑wait scenarios when read‑write conflicts occur.
Overall, the Percolator model offers a clean, simple design that works well when contention is low.
References
Codis author reveals TiKV transaction model, Google Spanner open‑source implementation
Analysis of Google Percolator transaction model
Large‑scale Incremental Processing Using Distributed Transactions and Notifications – Google Research
Database principle introduction – Google Percolator distributed transaction implementation
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.