Databases 16 min read

RDB: Cloud Music's Customized Algorithm Feature KV Storage System Based on RocksDB

To meet Cloud Music’s massive algorithm‑feature KV storage needs, the team built RDB—a RocksDB‑based engine within Tair—adding bulk‑load, dual‑version imports, KV‑separation, in‑place sequence appends and protobuf field updates, cutting storage cost, write amplification and latency while scaling to billions of records and millions of QPS.

NetEase Cloud Music Tech Team

Mar 16, 2022

RDB: Cloud Music's Customized Algorithm Feature KV Storage System Based on RocksDB

Business Background

Cloud Music's recommendation and search business requires storing large-scale algorithm feature data in key-value format for online read/write services. These features, derived from Spark or Flink tasks on the big data platform (including song features and user features), are characterized by large data volumes with daily full or real-time incremental updates and high query performance requirements. Previously, these features were stored in either Redis/Tair (memory-based) or MyRocks/HBase (disk-based) systems.

To reduce the cost of integrating multiple storage systems and customize development for algorithm feature KV data storage characteristics, the team introduced the RocksDB engine under the Tair distributed storage framework to support larger-scale algorithm feature KV data scenarios at lower cost. The memory-based storage (memcache engine) is called MDB, while the disk-based storage (RocksDB engine) is called RDB.

RDB Introduction

Tair, as a distributed storage framework, consists of ConfigServer and DataServer. DataServer nodes store actual data, with KV data divided into buckets based on hash values. Each bucket can have multiple replicas across different DataServer nodes, with routing determined by the routing table built by ConfigServer.

RocksDB is an open-source KV storage engine based on LSM (Log Structured Merge) structure, composed of multiple SST files organized in levels. Each SST file contains sorted KV data with metadata. Data is written to level0, and when it reaches a threshold, it compacts to level1 and beyond.

In Tair's RocksDB implementation, the key format is: bucket_id + area_id + original_key. The bucket_id facilitates data migration by bucket, while area_id distinguishes different business tables. For large tables, data is stored in separate column families to avoid key overlap. The value format includes metadata (modification time, expiration time) plus the original value, enabling expiration-based data deletion through custom CompactionFilter.

Bulkload for Batch Data Import

Algorithm feature data is often computed offline on the big data platform daily, with table scales frequently exceeding 100GB and 100 million records. The basic RDB could only write via put interface, requiring many concurrent tasks and causing significant write amplification during RocksDB compaction.

To address this, the team implemented a HBase-like bulkload mechanism: first sort and convert data to SST format using Spark, then load SST files into RDB clusters via RocksDB's ingest mechanism. This approach improves efficiency in two ways: batch file loading instead of individual put operations, and pre-sorted data reducing internal compaction.

Performance comparison showed bulkload is approximately 3x better than put in IO pressure, read RT, and compaction volume. Test scenario: 3.8TB full data (7.6TB with 2 replicas), importing 2.1 billion incremental records of 300GB (600GB with 2 replicas) in about 100 minutes, with read QPS of 12k/s.

Dual-Version Data Import

Building on bulkload, for full data coverage updates, the dual-version mechanism further reduces disk IO during import. Two versions (area_ids) correspond to two column families in Rocksdb. Data import and read versions are staggered and switched alternately. Before importing, invalid version data is cleared, completely avoiding compaction during data import.

Key-Value Separation Storage

RocksDB's compaction causes write amplification, especially severe for long values. The team implemented KV separation based on WiscKey research: values are stored separately in blob files, with the LSM tree only storing position indexes (fileno + offset + size).

The implementation uses TiDB's open-source KV separation plugin with minimal RocksDB code intrusion and a GC mechanism for invalid data recovery. Test results showed for long values, KV separation improved performance in both random write and bulkload scenarios. Random write saw 90% reduction in read RT, while bulkload saw over 50% reduction. The effective value length threshold is approximately 0.5KB~0.7KB, with online deployment configured at 1KB.

Test case with average value length of 5.3KB, 800GB full data (160 million records): without KV separation, average read RT was 1.02ms; with KV separation, it dropped to 0.44ms, a 57% reduction.

Sequence Append

Building on KV separation, the team implemented in-place value updates in blob files. For sequence-type values (like user history), updates append short sequences to long sequences. The original approach required read-append-write, causing excessive data IO.

The solution pre-allocates reserved space in blob files for each value (similar to STL vector's memory allocation). Sequence append writes directly to the value's tail in the blob file. If reserved space is insufficient, it falls back to read-append-write.

Production results: a real algorithm feature scenario reduced update data from TB to GB and time from 10 hours to 1 hour.

ProtoBuf Field Updates

Following the sequence append success, the team extended to support more general partial update interfaces (add/incr, etc.). Algorithm feature values are stored in ProtoBuf (PB) format. The team previously developed PDB (a memory-based storage engine supporting PB field-level updates) on MDB.

The solution reuses PDB's PB update logic and modifies Rocksdb code to enable in-place value modification after KV separation, avoiding excessive disk IO from frequent compaction. The implementation is complete and undergoing testing.

Summary

After over a year of development, RDB has been customized with new features based on algorithm feature data storage characteristics. The online RDB cluster has reached significant scale: hundreds of billions of data records, tens of TB of data, and peak QPS of millions per second.

The RDB self-developed features use a modified RocksDB (with KV separation) as the kernel, with customized application scenarios including offline feature bulkload, real-time feature snapshot, and PB field update protocols.

Limitations include: RDB uses Tair's hash-based partitioning by key, which doesn't support range scanning as well as range partitioning. Currently supported data structures and operation interfaces are relatively simple. Future plans include supporting more functions like statistical queries (sum/avg/max) for time series windows, and building a complete ML feature storage service integrated with the internal Feature Store platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Compaction LSM‑Tree cloud storage KV storage RocksDB KV Separation bulkload Algorithm Features Tair

Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.