How RedKV Achieves Billion‑QPS KV Storage with Multi‑Cloud Elastic Scaling
RedKV is Xiaohongshu's self‑developed NVMe‑SSD based distributed NoSQL KV store that combines Gossip and Shard control, delivering petabyte‑scale storage, near‑100 million QPS, multi‑cloud elasticity, low‑cost Redis compatibility, and advanced features such as token‑bucket rate limiting, online compression, backup‑read tail‑latency mitigation, and robust backup‑recovery mechanisms.
Background and Motivation
RedKV is a self‑developed distributed NoSQL KV storage system built on NVMe SSDs for Xiaohongshu, designed to satisfy real‑time write‑to‑disk requirements across the company. RedKV 1.0 uses a Gossip‑based node‑management protocol and already serves close to 100 million QPS with petabyte‑scale data. RedKV 2.0 adopts a centralized shard‑management architecture, supporting multi‑cloud, multi‑replica elastic scaling, cross‑region disaster recovery, and second‑level service switching.
Business Requirements
High‑QPS and low‑latency reads for feature‑data storage (tens of GB/s bandwidth, strict P99 latency).
Large‑scale image caching with low read latency and tolerance for occasional data loss.
Model‑data storage demanding sub‑10 ms P99 latency for tens of TB.
Deduplication services requiring P99 < 10 ms and P999 < 20 ms.
Risk‑control data with millions of QPS and P999 < 30 ms.
Design Goals
Support ultra‑high QPS and sub‑10 ms tail latency.
Reduce resource cost compared with Redis and HBase (≈50 % lower than Redis, >30 % lower than HBase).
Maintain compatibility with major Redis data types (string, hash, zset).
Provide seamless data exchange with Hive for offline analytics.
Three‑Layer Architecture
The system is divided into three layers: a client access layer compatible with the Redis protocol, a stateless proxy layer that handles millions of QPS, and a storage layer that offers reliable read/write services.
Client Access Layer
Provides Redis‑compatible endpoints and SDKs for various languages, allowing existing Redis clients to connect without code changes.
Proxy Layer
The proxy is a stateless CorvusPlus process that is compatible with legacy Redis clients. It offers multi‑threading, I/O multiplexing, and port reuse. Compared with open‑source equivalents, CorvusPlus adds self‑protection, observability, and online configurability. Key features include:
Token‑bucket rate limiting across connections, bandwidth, and QPS dimensions.
Online LZ4 compression of write traffic and decompression on reads, reducing bandwidth and storage usage by >40 % in cache‑heavy scenarios.
Optimized thread model that processes requests per‑key on dedicated worker threads, eliminating lock contention.
Backup‑read mechanism that issues duplicate reads when P99 latency exceeds P95, cutting P999 latency from ~35 ms to ~4 ms.
Big‑key detection using request‑size metrics and HeavyKeeper‑based hot‑key tracking.
Storage Layer
RedKV stores data in RocksDB instances deployed as multiple processes per node. Data is sharded by hash‑based slots to avoid hotspot keys. Supported data types are string, hash, and zset. The internal key format separates meta information (MetaKey/MetaValue) from actual data (DataKey/DataValue), enabling flexible slot‑based migration during scaling.
Gossip vs. Shard Optimizations
RedKV 1.0 improves view convergence time by three steps:
Probe time optimization: When a node has not received a pong for >2 s, it immediately triggers a ping to the suspected failed node, reducing detection time from ~30 s to ~2 s.
PFAIL time optimization: The default node‑timeout (≈15 s) is reduced to 3 s, with a 24‑hour grace period to filter out transient network jitter.
FAIL promotion acceleration: A designated seed node receives all PFAIL messages, allowing the cluster to mark a node as FAIL as soon as any half of the nodes report PFAIL.
RedKV Server Thread Model
Multiple I/O threads listen on a single port; each thread balances incoming connections randomly. Requests for the same key/slot are dispatched to the same worker thread, avoiding per‑key locking. Workers re‑encode data and write to the local RocksDB engine.
Data Replication and Bulk Load
RedKV implements a one‑way replication based on an extended Redis protocol, eliminating third‑party sync components. For multi‑cloud deployments, a central‑controlled replication strategy uses checkpoint‑based data transfer, supporting many‑to‑many shard deployments without extra cleanup jobs.
Bulk loading from Hive is achieved via a custom UDTF that produces SST files on S3, followed by a side‑car ingest process on each RedKV node.
hmset {person}_1 name John quantity 20 price 200.23</code><code>hmset {person}_2 name Henry quantity 30 price 3000.45For massive tables the key can be sharded into 16 prefixes, e.g.:
hmset {person:1}_1 name John quantity 20 price 200.23</code><code>hmset {person:1}_2 name Henry quantity 30 price 3000.45</code><code>...</code><code>hmset {person:16}_100000 name Tom quantity 43 price 234.56Backup and Recovery
RedKV adopts a snapshot‑based disaster‑recovery strategy. A standby cluster stores a configurable number of snapshots. During low‑traffic periods the primary cluster creates snapshots, compresses them, and transfers them via rsync to the standby. In case of failure, the standby can be switched to within seconds using the Proxy’s service‑registration mechanism.
Cross‑Cloud Multi‑Active Design
RedKV’s multi‑active solution deploys a sidecar replicator on the same machine as the main service, providing low network overhead, minimal intrusion, and independent upgrade paths. The design is applicable to Redis, graph databases, and other log‑type storage systems.
Practical Case: zprofile Service Migration
Previously, user and note data were stored in HBase with dual‑cluster deployment, incurring ~70 ms P99 latency at million‑QPS scale and high storage cost. After migrating to RedKV 1.0, the zprofile platform achieved a 36 % cost reduction and a ~5× improvement in P99 latency while handling comparable QPS.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
