Databases 22 min read

Designing a High‑Performance Distributed KV Store for B‑Station

This article details the background, architecture, core features, and operational practices of a custom high‑reliability, high‑throughput key‑value storage system that combines Raft replication, flexible partitioning, binlog support, bulk loading, and multi‑active deployment for B‑Station's diverse data workloads.

dbaplus Community
dbaplus Community
dbaplus Community
Designing a High‑Performance Distributed KV Store for B‑Station

Background

In B‑Station, various data models ranging from complex relational data (accounts, videos) to simple key‑value pairs coexist. Early solutions used MySQL for persistence and Redis for caching, which introduced cache‑consistency challenges (solved via Canal) and high development overhead due to per‑service scripts. The need emerged for a persistent, high‑performance KV store positioned between Redis and MySQL, with strong consistency, reliability, and scalability.

Architecture Overview

The system consists of three core components:

Metaserver : Manages cluster metadata, monitors node health, handles failover, and performs load balancing.

Node : Stores KV data; each node holds a replica of a shard. Replicas achieve consistency via the Raft protocol, and a primary node serves reads/writes. Clients may optionally read from followers to improve availability.

Client : Provides two access methods—proxy (C++ SDK) and native SDK. The SDK obtains shard metadata from Metaserver, routes requests to the appropriate Node, and implements retry and backoff logic.

Cluster Topology

The topology includes Pools, Zones, Nodes, Tables, Shards, and Replicas. Pools group multiple zones for resource isolation. Zones are fault‑isolated network domains (e.g., a data center). Nodes are physical hosts storing data. Tables map to business tables. Shards are logical partitions of a table, and Replicas are Raft‑backed copies of a shard, distributed across zones to ensure fault tolerance. Each replica runs an engine (RocksDB or SparrowDB, the latter optimized for large values).

Key Features

1. Partition Splitting

Both range and hash partitioning are supported. Hash partitioning avoids hotspots but loses global ordering; range partitioning preserves order but can cause write hotspots. Splitting is triggered when shard size grows, with hash mode simply doubling the shard count. The split process involves Metaserver, Node, and Client coordination, updating shard states from splitting to normal after data migration.

2. Binlog Support

Raft logs are repurposed as KV binlogs, enabling real‑time event streaming and cold backup to object storage for long‑term replay. Learner nodes continuously pull binlogs, filter out Raft configuration changes, and forward data downstream.

3. Multi‑Active Deployment

Learner‑driven replication synchronizes writes across data‑center clusters. Read‑only multi‑active allows one region to serve reads while another writes; write‑multi‑active permits concurrent writes in both regions with unit‑level data partitioning to avoid conflicts, using binlog tagging to prevent loops.

4. Bulk Load

Large offline datasets are converted into SST files and uploaded to object storage. Nodes ingest these files directly, bypassing per‑record writes, reducing write amplification and speeding up data availability. Offline compaction can also be performed on SSTs before loading.

5. KV Storage Separation (SparrowDB)

Inspired by Bitcask, SparrowDB separates index (stored in RocksDB) from value data (append‑only files). This reduces write amplification during compaction, as only indexes are rewritten. Small values are inlined to avoid extra I/O, while large values remain in separate files.

6. Load Balancing

Replica count balancing ensures even disk usage, while primary‑replica balancing distributes Raft leaders across nodes to avoid CPU and network hotspots. Metaserver monitors load and migrates replicas accordingly.

7. Failure Detection & Recovery

Metaserver sends heartbeats to Nodes; failed Nodes are marked unhealthy and their replicas are re‑replicated to healthy Nodes. Heartbeat forwarding mitigates false positives due to network partitions. Recovery involves snapshot copying of Raft state, with rate‑limited I/O to protect online traffic.

Practical Experience

RocksDB Optimizations

TTL‑based data expiration using compaction_filter and periodic compaction.

Mitigating scan slowdown from deleted keys via CompactOnDeletionCollector and tuned deletion_trigger.

Controlling write amplification by disabling the collector for specific workloads and adjusting table‑level compaction policies.

Auto‑tuned I/O rate limiting for compaction using NewGenericRateLimiter with auto_tuned and appropriate RateLimiter::Mode.

Disabling WAL for RocksDB writes because Raft guarantees durability, reducing disk I/O.

Raft Enhancements

Dynamic replica reduction to a single copy for rapid recovery when multiple replicas fail.

Aggregated RaftLog commits (e.g., batch every 5 ms or 4 KB) to improve throughput.

Asynchronous flush of RaftLog for workloads tolerant of occasional data loss, avoiding fsync‑induced latency spikes.

Future Directions

Transparent multi‑tier storage with automatic hot‑cold data separation.

Integration of SPDK and PMEM for accelerated I/O.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

load balancingBinlogbulk loadfailure recoveryPartitioningdistributed KV storeRaft replication
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.