Designing a High‑Performance Distributed KV Store for B‑Station
This article details the background, architecture, core features, and operational practices of a custom high‑reliability, high‑throughput key‑value storage system that combines Raft replication, flexible partitioning, binlog support, bulk loading, and multi‑active deployment for B‑Station's diverse data workloads.
Background
In B‑Station, various data models ranging from complex relational data (accounts, videos) to simple key‑value pairs coexist. Early solutions used MySQL for persistence and Redis for caching, which introduced cache‑consistency challenges (solved via Canal) and high development overhead due to per‑service scripts. The need emerged for a persistent, high‑performance KV store positioned between Redis and MySQL, with strong consistency, reliability, and scalability.
Architecture Overview
The system consists of three core components:
Metaserver : Manages cluster metadata, monitors node health, handles failover, and performs load balancing.
Node : Stores KV data; each node holds a replica of a shard. Replicas achieve consistency via the Raft protocol, and a primary node serves reads/writes. Clients may optionally read from followers to improve availability.
Client : Provides two access methods—proxy (C++ SDK) and native SDK. The SDK obtains shard metadata from Metaserver, routes requests to the appropriate Node, and implements retry and backoff logic.
Cluster Topology
The topology includes Pools, Zones, Nodes, Tables, Shards, and Replicas. Pools group multiple zones for resource isolation. Zones are fault‑isolated network domains (e.g., a data center). Nodes are physical hosts storing data. Tables map to business tables. Shards are logical partitions of a table, and Replicas are Raft‑backed copies of a shard, distributed across zones to ensure fault tolerance. Each replica runs an engine (RocksDB or SparrowDB, the latter optimized for large values).
Key Features
1. Partition Splitting
Both range and hash partitioning are supported. Hash partitioning avoids hotspots but loses global ordering; range partitioning preserves order but can cause write hotspots. Splitting is triggered when shard size grows, with hash mode simply doubling the shard count. The split process involves Metaserver, Node, and Client coordination, updating shard states from splitting to normal after data migration.
2. Binlog Support
Raft logs are repurposed as KV binlogs, enabling real‑time event streaming and cold backup to object storage for long‑term replay. Learner nodes continuously pull binlogs, filter out Raft configuration changes, and forward data downstream.
3. Multi‑Active Deployment
Learner‑driven replication synchronizes writes across data‑center clusters. Read‑only multi‑active allows one region to serve reads while another writes; write‑multi‑active permits concurrent writes in both regions with unit‑level data partitioning to avoid conflicts, using binlog tagging to prevent loops.
4. Bulk Load
Large offline datasets are converted into SST files and uploaded to object storage. Nodes ingest these files directly, bypassing per‑record writes, reducing write amplification and speeding up data availability. Offline compaction can also be performed on SSTs before loading.
5. KV Storage Separation (SparrowDB)
Inspired by Bitcask, SparrowDB separates index (stored in RocksDB) from value data (append‑only files). This reduces write amplification during compaction, as only indexes are rewritten. Small values are inlined to avoid extra I/O, while large values remain in separate files.
6. Load Balancing
Replica count balancing ensures even disk usage, while primary‑replica balancing distributes Raft leaders across nodes to avoid CPU and network hotspots. Metaserver monitors load and migrates replicas accordingly.
7. Failure Detection & Recovery
Metaserver sends heartbeats to Nodes; failed Nodes are marked unhealthy and their replicas are re‑replicated to healthy Nodes. Heartbeat forwarding mitigates false positives due to network partitions. Recovery involves snapshot copying of Raft state, with rate‑limited I/O to protect online traffic.
Practical Experience
RocksDB Optimizations
TTL‑based data expiration using compaction_filter and periodic compaction.
Mitigating scan slowdown from deleted keys via CompactOnDeletionCollector and tuned deletion_trigger.
Controlling write amplification by disabling the collector for specific workloads and adjusting table‑level compaction policies.
Auto‑tuned I/O rate limiting for compaction using NewGenericRateLimiter with auto_tuned and appropriate RateLimiter::Mode.
Disabling WAL for RocksDB writes because Raft guarantees durability, reducing disk I/O.
Raft Enhancements
Dynamic replica reduction to a single copy for rapid recovery when multiple replicas fail.
Aggregated RaftLog commits (e.g., batch every 5 ms or 4 KB) to improve throughput.
Asynchronous flush of RaftLog for workloads tolerant of occasional data loss, avoiding fsync‑induced latency spikes.
Future Directions
Transparent multi‑tier storage with automatic hot‑cold data separation.
Integration of SPDK and PMEM for accelerated I/O.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
