Databases 10 min read

Understanding RocksDB Write Process and Group Commit Mechanism

This article explains the ACID properties, write‑ahead logging, and how RocksDB implements a three‑step write path and a leader‑follower Group Commit mechanism to improve transaction throughput by batching fsync operations.

Tencent Database Technology
Tencent Database Technology
Tencent Database Technology
Understanding RocksDB Write Process and Group Commit Mechanism

0. Intro

From the Wikipedia entry for ACID , we learn that a DBMS must guarantee four properties—atomicity, consistency, isolation, and durability—during write or update operations. To ensure atomicity and durability, the system writes all operation details to a Write‑Ahead Log (WAL) before modifying in‑memory data structures, allowing recovery after a crash.

Because each transaction commit traditionally requires an expensive fsync to flush the log, committing becomes a bottleneck. Group Commit merges several pending transactions into a single fsync , dramatically increasing TPS.

1. RocksDB Write Process

The RocksDB (MyRocks) write path consists of three steps:

Package one or more operation records into a WriteBatch .

Append the batch’s log records to the WAL file.

Insert the batch’s records into the in‑memory memtable.

Each WriteBatch represents a transaction and can contain multiple key/value operations via WriteBatch::Put , WriteBatch::Delete , etc.

2. RocksDB Group Commit

RocksDB also employs Group Commit to improve commit performance. Every write thread creates a WriteThread::Write instance linked to a WriteBatch . The internal writer structure is:

struct Writer {
    WriteBatch* batch;
    bool sync;
    bool no_slowdown;
    bool disable_wal;
    bool disable_memtable;
    uint64_t log_used;  // log number that this batch was inserted into
    uint64_t log_ref;   // log number that memtable insert should reference
    WriteCallback* callback;
    bool made_waitable;          // records lazy construction of mutex and cv
    std::atomic<uint8_t> state;   // write under StateMutex() or pre-link
    WriteGroup* write_group;
    SequenceNumber sequence;     // the sequence number to use for the first key
    Status status;              // status of memtable inserter
    Status callback_status;      // status returned by callback->Callback()
    std::aligned_storage<sizeof(std::mutex)>::type state_mutex_bytes;
    std::aligned_storage<sizeof(std::condition_variable)>::type state_cv_bytes;
    Writer* link_older;  // read/write only before linking, or as leader
    Writer* link_newer;  // lazy, read/write only before linking, or as leader
}

Writers form a linked list; pending transactions are added to the tail via JoinBatchGroup(&w) . The list is processed in order, merging multiple writes into a single WAL write that a single thread later fsync s.

RocksDB distinguishes leader and follower threads for Group Commit:

The leader thread batches its own and followers’ WAL records and writes them to the WAL file.

If allow_concurrent_memtable_write is enabled, the leader notifies followers to write to the memtable concurrently; otherwise the leader writes all followers’ data serially.

When a thread becomes the first entry in the write list, it assumes the leader role and executes the following simplified steps:

Call WriteThread::EnterAsBatchGroupLeader to create a WriteGroup describing the batch to be committed. The maximum batch size is set to 1 MB if the leader’s log length exceeds 128 KB; otherwise it is the leader’s length plus 128 KB.

Identify the newest writer in the list and link the entire list into a doubly‑linked structure.

Traverse from the leader to the newest writer, accumulating sizes until the max size is reached or flags differ, then record the last writer in WriteGroup::last_write .

Check whether concurrent memtable writes are allowed (memtable support, no merge operations, and the flag is set).

Write the merged group to the WAL and perform an fsync .

If concurrent memtable writes are allowed, invoke LaunchParallelMemTableWriter to let followers write in parallel; otherwise the leader writes serially.

All threads call CompleteParallelMemTableWriter to determine the last finishing thread. The last follower calls ExitAsBatchGroupFollower , which triggers ExitAsBatchGroupLeader to notify all followers; if the leader finishes last, it calls ExitAsBatchGroupLeader directly.

Before exiting, the leader checks whether new writers have been added to the list; if so, one of them becomes the new leader and the process repeats.

The writer state machine includes:

STATE_INIT – initial state.

STATE_GROUP_LEADER – selected as leader.

STATE_MEMTABLE_WRITER_LEADER – leader responsible for serial memtable writes.

STATE_PARALLEL_MEMTABLE_WRITER – follower performing concurrent memtable writes.

STATE_COMPLETED – Group Commit finished.

STATE_LOCKED_WAITING – writer waiting for state change.

3. Summary

This article introduced the RocksDB storage engine’s Group Commit mechanism for writing data, detailing the leader‑follower coordination, batch size calculation, WAL flushing, and optional concurrent memtable writes.

4. References

ACID: https://zh.wikipedia.org/wiki/ACID

DatabaseConcurrencyRocksDBgroup commitwrite-ahead logging
Tencent Database Technology
Written by

Tencent Database Technology

Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.