When DeleteRange Becomes a Performance Killer in RocksDB
This article explains how overusing RocksDB's DeleteRange for bulk file deletions can cause severe get latency spikes and uncontrolled memory growth, analyzes the underlying range tombstone mechanisms, and shares a practical optimization that replaces DeleteRange with regular deletes.
1. Background
Our distributed file storage system (polefs) uses RocksDB for the metadata subsystem. The metadata table uses inode as primary key, with column families cf1, cf2, etc. The rows are schema‑less; some column names are unpredictable because they record client‑specific information, e.g., inode‑>clientId indicating which client opened the inode.
When deleting a file we only know a prefix, not the exact column names, so we first scan to collect all columns and then delete them. To accelerate massive file deletions we heavily use RocksDB DeleteRange, which deletes a range of columns within a single inode and is a high‑frequency operation.
2. RocksDB DeleteRange Overview
Range deletion is a common requirement in KV stores, e.g., deleting all keys with a given prefix. Before DeleteRange, the implementation required a costly Scan+Delete loop:
// DeleteRange before: inefficient Scan+Delete
void delete_by_prefix_old(std::string prefix) {
rocksdb::Iterator* it = db->NewIterator(read_options);
for (it->Seek(prefix); it->Valid() && it->key().starts_with(prefix); it->Next()) {
db->Delete(write_options, it->key()); // delete one by one
}
delete it;
// time complexity O(N), performance degrades linearly with data size
}With DeleteRange the same operation becomes extremely efficient:
// Using DeleteRange
void delete_by_prefix_new(std::string prefix) {
std::string end = prefix + "\xff"; // compute range end key
db->DeleteRange(write_options, db->DefaultColumnFamily(), prefix, end);
// time complexity O(1), completes quickly regardless of data size
}Performance comparison: deleting 100 000 rows requires 100 000 individual deletes with the traditional method, while DeleteRange needs only one range delete.
3. Misusing DeleteRange Leads to Two Problems
Problem 1: Get latency spikes
In environments with massive continuous DeleteRange operations, even when the queried key resides in MemTable or BlockCache, get latency keeps rising. The reason is that DeleteRange creates range tombstones; each get must check every tombstone that may cover the key, turning the lookup into an O(n) operation.
// Simplified RocksDB Get path with range tombstones
Status Get(const Slice& key) {
// Check range tombstones in MemTables
for (auto* memtable : memtable_list_) {
if (memtable->range_tombstones_.Covers(key)) {
return Status::NotFound(); // deleted by tombstone
}
Status s = memtable->Get(key, value, &found);
if (found) return s;
}
// Check SST files' range tombstones
for (auto& level : levels_) {
for (auto& file : level.files) {
if (file->range_tombstones.Covers(key)) {
return Status::NotFound();
}
// normal SST lookup …
}
}
}Problem 2: Uncontrolled memory growth
Memory usage keeps rising; BlockCache and WriteBufferManager cannot bound it. About 31 % of memory is held by iterators that process DeleteRange records.
Each DeleteRange adds a new range tombstone to the in‑memory structures (RangeTombstoneSet and VersionStorageInfo). These tombstones remain in memory until compaction, causing fragmentation and increased lookup cost.
class RangeTombstoneSet {
std::vector<RangeTombstone> tombstones_; // each DeleteRange adds one
// tombstones stay in memory before compaction
};
class VersionStorageInfo {
std::vector<FileMetaData*> level_files_;
// each SST file contains a RangeTombstoneList that grows with DeleteRange
};A bad pattern is frequent small DeleteRange calls, e.g.:
// Poor practice: frequent DeleteRange
class MetadataManager {
void cleanup_expired_metadata() {
while (true) {
auto expired_ranges = find_expired_ranges();
for (auto& range : expired_ranges) {
// frequent small DeleteRange
db->DeleteRange(options_, cf_, range.start, range.end);
}
std::this_thread::sleep_for(60s);
}
}
};This leads to tombstone fragmentation, complex query paths, and memory accumulation that compaction may not release.
4. Optimization
We solved the issue by eliminating all DeleteRange operations. By refactoring the system we replaced DeleteRange with ordinary Delete calls, preserving performance while avoiding the side effects.
5. Conclusion
DeleteRange is a powerful tool but should be used cautiously; it is unsuitable for high‑frequency scenarios. Prefer not using it, use it sparingly, or, if unavoidable, control its usage and trigger regular compaction to clean up range tombstone fragments.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
