Databases 13 min read

How URocksDB Solves RocksDB’s Availability and Performance Challenges in UCloud

This article explains how UCloud’s UKV-Meta leverages a customized URocksDB storage engine, addresses RocksDB’s durability and read‑performance drawbacks with a Share‑Nothing architecture, introduces hotspot splitting, hot‑standby replication, and seamless data migration to achieve high availability and scalability in cloud object storage.

UCloud Tech
UCloud Tech
UCloud Tech
How URocksDB Solves RocksDB’s Availability and Performance Challenges in UCloud

Introduction

In the previous article "UCloud Object Storage US3 Metadata Refactoring (Part 1)" we introduced the new UKV‑Meta metadata service built on UCloud’s self‑developed distributed compute‑storage separation KV system UKV. This section describes UKV and its related components.

URocksDB

I. Overview URocksDB is the UKV storage engine, a customized Key‑Value engine derived from the open‑source RocksDB.

II. RocksDB LevelDB, created by Google, is a simple LSM‑Tree based KV store. RocksDB, developed by Facebook on top of LevelDB, adds features like Column Families. RocksDB is widely used in projects such as TiKV, MyRocksDB, and CrockRoachDB.

Drawbacks of RocksDB

RocksDB ensures durability via a Write‑Ahead Log (WAL). Data is first written to the WAL, then to an in‑memory Memtable, which flushes to SST files when thresholds are reached. On crash, RocksDB replays the WAL to recover.

Two major problems arise:

WAL replay is linear; with large Memtables the recovery time after a crash can be long.

If the disk fails and the WAL is corrupted, data loss can occur even with RAID.

Consequently, RocksDB’s availability is a significant concern, and its LSM architecture also impacts read performance.

Solution: Share Nothing

To meet data consistency, most solutions place a consistency protocol above RocksDB, replicating each piece of data to multiple RocksDB instances on different machines. TiKV exemplifies this by using three RocksDB instances per group in a Share‑Nothing mode, allowing recovery from a single instance loss via log or data migration.

However, this approach has drawbacks:

High CPU cost due to write amplification from frequent compactions, especially with multiple replicas, reducing SSD lifespan.

Coupled compute and storage leads to resource waste when either side becomes a bottleneck.

Scaling storage requires massive data migration.

Compute‑Storage Separation

Data is stored in a distributed file system (Manul) that ensures consistency and scalability, while compute nodes only handle reads and writes.

URocksDB Hot‑Standby

RocksDB supports a read‑only mode that provides a snapshot but cannot read the latest data. URocksDB implements a Secondary Instance (RSI) mode that reads the latest writes from the underlying distributed storage in real time, keeping the standby node’s memory in sync with the primary and enabling fast failover.

RSI Data Synchronization

Because the storage is distributed, a standby node can see the same data as the primary. It periodically tail‑reads and replays the primary’s WAL and Manifest files into its memory to stay up‑to‑date.

RSI Disaster Recovery

When the primary node restarts, it must replay all WAL logs to rebuild its Memtable, which can be time‑consuming for large Memtables. The standby node, having continuously tail‑read the latest WAL, can quickly become the new primary by transferring its Memtable and VersionSet, reducing failover time to under 100 ms.

UKV

UKV is UCloud’s self‑developed distributed KV store with compute‑storage separation. Its storage layer is the distributed system Manul, offering automatic data balancing, heterogeneous media support, and erasure coding. UKV provides cluster management, fast backup, and a metadata‑optimized data structure for US3.

UKV uses URocksDB as the compute node and leverages Manul for hot‑standby, rapid hotspot splitting, and other features.

Hotspot Splitting

UKV employs range sharding, which can create hotspots. To address this, UKV implements hotspot splitting, automatically detecting hot nodes and moving data to idle nodes.

Design Principles

Service availability – services must not be down for long periods.

Data integrity – metadata must be accurate.

Failure rollback – unexpected states must be recoverable.

Automation – automatically detect hot nodes and select idle targets.

Splitting Process

The split consists of four stages:

SPLITTING_BEGIN : ConfigServer initiates a SplitRequest.

HARDLINK_BEGIN : UKVRequest updates hard‑link status with range info.

TARGET_OPEN : ConfigServer sends an OpenRequest to the target (opens without starting compaction).

OPEN_COMPACTION_FILTER : ConfigServer uses heartbeats to enable compaction filters on source and target.

By using compute‑storage separation, hard‑linking, and RocksDB’s compaction features, data movement is avoided while filtering non‑range data.

Splitting Timing

With a per‑node size limit of 256 GB, splits complete within 300 ms online, making the process invisible to users.

Data Migration

Because the old list storage uses a directory‑first structure and the new one uses lexical order, a proxy was developed for list service compatibility, avoiding changes to upper‑level business logic.

Migration services safely move metadata from the old architecture to UKV‑Meta without affecting user workloads, ensuring consistency and providing real‑time verification.

Conclusion

High hardware costs drive cloud providers to demand finer‑grained CPU and storage utilization. Coupled compute‑storage architectures cause data reshuffling and high network and CPU overhead during scaling. By separating compute and storage, resources can be scaled independently without data migration costs, allowing each to be optimized for its own characteristics.

RocksDBHotspot SplittingURocksDB
UCloud Tech
Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.