ScyllaDB Row‑Level Repair: Design, Implementation, and Performance Evaluation
ScyllaDB, a high‑performance C++ implementation of Apache Cassandra, introduces row‑level repair to replace the traditional partition‑level repair, reducing data transfer and I/O by operating on individual rows; the presentation details its architecture, multi‑stage process, experimental results, and the resulting six‑fold speedup.
As big‑data workloads continue to grow, NoSQL database systems such as Apache Cassandra have become widely adopted. ScyllaDB, an open‑source high‑performance Cassandra implementation written in C++, offers substantial performance gains and introduces a new approach to the essential maintenance operation nodetool repair .
ScyllaDB Introduction
ScyllaDB was created by a team with deep low‑level software experience, including the authors of KVM and OSv. After initial OS‑level optimisations that yielded only a ~20 % performance boost for Cassandra, the team built the Seastar C++ framework and rewrote Cassandra on top of it, resulting in ScyllaDB.
Key characteristics of ScyllaDB include:
A lightning‑fast NoSQL engine capable of >1 000 000 QPS per node with sub‑millisecond latency at 99 % load.
Implementation in C++ without a garbage collector.
One thread per physical CPU core, lock‑free and without shared state.
User‑space CPU and disk schedulers, and an optional user‑space TCP/IP stack.
Compatibility with Apache Cassandra and Amazon DynamoDB APIs.
Row‑Level Repair Concept
Repair is a crucial maintenance operation in Cassandra that ensures consistency among replicated data. Traditional partition‑level repair works on groups of partitions, generating a checksum (Merkle tree for Cassandra, token‑range splits for ScyllaDB) for each group. This coarse granularity can cause unnecessary data transfer when only a single row is inconsistent, and large partitions (up to several gigabytes) exacerbate the problem.
Two‑stage partition‑level repair consists of (1) comparing hash values to locate inconsistencies and (2) streaming the mismatched data. This approach requires two full reads of the data from disk, leading to high I/O cost.
Row‑Level Repair Design
Row‑level repair reduces the repair granularity to individual rows. Each row is hashed to produce a checksum, and a set‑reconciliation algorithm builds a compact data structure that allows the master node to infer which rows are missing on each follower. Only the mismatched rows are transferred, and the whole process runs in a single stage because the data set fits into memory, eliminating the extra streaming phase.
The method consists of five main steps:
Master and followers negotiate a sync boundary that defines a range small enough to fit entirely in memory.
Data within the sync boundary is moved to a working row buffer on each node; each follower returns an aggregate hash of its buffer.
The master collects the hash sets of all rows from the followers.
Based on the hash differences, the master fetches the missing rows from the appropriate followers.
The master then pushes the rows that its followers lack, completing the repair for that range.
Implementation Details
During negotiation, each node reads rows until a configurable cache (e.g., 32 MiB) is full, hashes each row, and computes a combined hash (e.g., XOR). Followers return both the combined hash and the last row as the sync boundary. The master selects the smallest sync boundary among all nodes to ensure the range fits in memory.
In the second step, each node moves the rows inside the sync boundary to its working row buffer and again computes an overall hash. If the hashes match, the range is already consistent; otherwise the process proceeds.
Step three gathers the full set of row hashes. Two strategies are possible: a “brute‑force” approach where followers send every row hash (effective when >15 % of rows differ) or a compact set‑reconciliation structure such as IDF (effective when differences are small).
After the master knows which rows are missing on each follower, step four fetches those rows from the appropriate followers into the master’s buffer and writes them to disk (SSTables). Finally, step five pushes the rows that each follower lacks back to them, after which all nodes hold identical data for the processed range.
Experimental Evaluation
A three‑node cluster (1 TB total, 1 billion rows, 1 KB per row, one row per partition) was used to test three scenarios:
One node empty, the other two identical (node rebuild).
All nodes fully consistent (baseline repair speed).
99.9 % of rows consistent, 0.1 % inconsistent (typical periodic repair situation).
Results show that in the third scenario row‑level repair achieved roughly a six‑fold speedup over partition‑level repair. The primary reasons are:
No unnecessary data transfer – only ~4 % of the data moved compared with partition‑level repair.
Faster hashing – row‑level repair uses a lightweight 64‑bit xxhash instead of the 256‑bit SHA‑256 used for partition‑level.
Higher parallelism – row‑level repair can repair many token ranges concurrently (16 in Scylla 3.1, more in later versions), adapting to node memory and network latency.
Conclusion
Row‑level repair dramatically reduces the amount of data transferred, I/O operations, and overall repair time by operating at the granularity of individual rows rather than whole partitions. It leverages efficient hashing, set‑reconciliation, and increased parallelism, making it a generic, high‑performance solution for maintaining consistency in Cassandra‑compatible NoSQL databases.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.