Databases 15 min read

How Do Fault‑Tolerant Transactions Work? Exploring Raft, KV Engines, and Concurrency Control

This article examines multiple fault‑tolerant transaction designs—RSM‑based KV, RSM‑based transactions, shared‑storage approaches, high‑availability KV layers, and single‑node engine extensions—comparing their replication strategies, lock handling, and performance trade‑offs while raising open questions about ordering and consistency.

dbaplus Community
dbaplus Community
dbaplus Community
How Do Fault‑Tolerant Transactions Work? Exploring Raft, KV Engines, and Concurrency Control

Basic Concepts

Fault‑tolerance means that a set of networked computers continues to provide service despite some nodes experiencing stop‑failure. The discussion assumes readers understand Raft and basic concurrency control, and references databases such as Spanner, TiKV, and MongoDB.

1. RSM‑Based Fault‑Tolerant KV

Replicated State Machine (RSM) was introduced in "Implementing fault‑tolerant services using the state machine approach". If several state machines start from the same state and execute the same command sequence in the same order, they end in identical states, allowing any failed machine to be replaced.

In a KV engine, commands correspond to Put/Get operations, the state machine is the KV engine itself, and the execution order is determined by the replication log. By first replicating KV operations with Raft and then applying them to the KV engine, a fault‑tolerant KV store is obtained.

Serial vs. Parallel Apply: Raft is often criticized for serial commit and apply, but this is not inherent to Raft.

Two Logs: Raft maintains a log, and the KV engine may have its own write‑ahead log (WAL), causing I/O amplification; merging them is desirable.

Checkpoint: Needed to speed up recovery.

Read‑only replication: Should read‑only operations be replicated?

Composite commands: Can a single command represent a multi‑row transaction?

2. RSM‑Based Transactions

Can a transaction be treated as a single command in an RSM? Since Raft applies commands serially, bundling all transaction operations into one command seems feasible. However, real transactions are interactive, contain conditional logic, and may depend on external state, requiring dedicated concurrency‑control logic.

To handle concurrency control, a lock table and transaction manager are added on the Raft leader. Using Strict Two‑Phase Locking (S2PL) as an example:

Acquire read lock before reading, write lock before writing; reads go through Raft, writes are buffered locally.

Release read locks when the client decides to commit; write a transaction log containing all writes via Raft.

During Raft apply of the transaction log, apply writes to the KV engine and release write locks.

The same pattern works for Snapshot Isolation: obtain a KV snapshot at transaction start, read from the snapshot, buffer writes locally, and on commit write a transaction log via Raft after checking for write‑write conflicts.

3. Shared‑Storage‑Based Transactions

By moving the replication protocol out of a share‑nothing design into a shared‑storage model, a single‑node transaction engine can be placed on top of a highly available storage system, eliminating the need to implement a replication protocol.

How to provide read‑only nodes for read scaling?

How to achieve faster failover for compute nodes?

How to push more operations down to storage nodes?

4. High‑Availability KV‑Based Transactions

Instead of tightly coupling KV, Raft, lock table, and transaction manager in one node, they can be layered. Following Google’s classic approach (GFS → Bigtable → Percolator), the KV layer is extracted, and lock table and transaction manager are built on top of it.

Lock Table: add a {Value, Lock} column to each key.

Txn Manager: record transaction status per primary key, extending the value to {Value, Lock, TxnStatus}.

MVCC: store multiple versions, turning the KV into {Key, Version} → {Value, Lock, TxnStatus}.

This design scales transaction throughput with KV scalability.

5. Single‑Node Engine HA Transactions

In a normal single‑node transaction, after local commit a binlog is written and replicated before responding to the client. This adds replication latency but allows locks to be released earlier, improving overall throughput.

Key open questions include which log to replicate (journal vs. separate binlog), how to order replication, and how differing orders between transaction serialization and RSM state‑machine order affect consistency.

To keep transaction order consistent with replication, an increasing OpTime identifier is used. OpTime must align with transaction serialization order for conflicting concurrent transactions; otherwise anomalies like read‑uncommitted can appear.

In S2PL, assign OpTime after acquiring locks but before commit, ensuring OpTime respects the serialization order.

Comparison

Three representative systems:

Spanner (first scheme): replicates transaction REDO; commit order follows Raft log order.

TiKV / Percolator (second scheme): Raft replicates only KV; transaction order is independent of Raft log.

MySQL / MongoDB (third scheme): Apply before replication, enabling concurrent apply.

Complexity ranking: second simplest , first moderate , third most complex due to divergent ordering.

From concurrency perspective, the third scheme supports perfect concurrency with short lock holding; the first two hold locks longer but can still achieve concurrency during apply.

From read/write overhead perspective, the first scheme can merge replication and engine logs; the second usually needs separate binlog and journal; the third adds extra data to KV, increasing I/O.

Conclusion

Although each system appears to be a simple combination of a few techniques, the interplay between RSM ordering and transaction serialization creates subtle challenges. Open questions remain about unifying RSM order with transaction order, checkpoint coordination, and handling multiple logs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

fault toleranceReplicationDistributed TransactionsKV StoreRaft
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.