Backend Development 13 min read

Evolution, Design, and Implementation of Bilibili's Distributed KV Storage System

This article details how Bilibili's KV storage system evolved from early solutions like Redis and MySQL to a highly scalable, high‑availability distributed architecture, describing its overall design, node components, data splitting, multi‑active disaster recovery, typical use cases, and operational challenges.

DataFunTalk
DataFunTalk
DataFunTalk
Evolution, Design, and Implementation of Bilibili's Distributed KV Storage System

The talk, presented by senior Bilibili engineer Lin Tanghui, introduces the rapid growth of Bilibili's business and the need for a storage system that can handle exponential traffic peaks while providing high availability through multi‑active replication.

Storage Evolution – Early KV solutions (Redis/Memcache, Redis+MySQL, HBase) could not scale: MySQL could not store >10 TB, Redis Cluster suffered from Gossip‑based communication overhead, and HBase incurred high tail latency and memory costs. Requirements were defined: 100× horizontal scalability, low latency, high QPS, fault‑tolerance, low cost, and zero data loss.

Design Implementation

1. Overall Architecture – Consists of three parts: Client (SDK access), Metaserver (metadata of tables, shards, and node placement), and Node (stores replicas). Nodes use Raft for replica synchronization, ensuring high availability.

2. Cluster Topology – Resources are organized into Pools (online/offline), Zones (fault isolation), Nodes (contain multiple disks and replicas), and Shards (range or hash split).

3. Metaserver – Manages resources, metadata distribution, health checks, load monitoring, load balancing, split management, Raft leader election, and persists metadata with RocksDB.

4. Node – Contains background threads (Binlog management, garbage collection, health checks, compaction), RPC interface with quota‑based throttling, and an abstract engine layer to handle different workloads (e.g., large‑value optimization).

5. Data Splitting & Balancing – Automatic shard splitting (range/hash) when a 24 GB shard exceeds capacity, with transparent redirection during split and millisecond‑level rebalancing using RocksDB checkpoints.

6. Multi‑Active Disaster Recovery – Binlog‑based cross‑region replication (e.g., between cloud‑cube and Jiading data centers) enables automatic failover when an entire region fails.

Scenarios & Problems

Typical use cases include user profiling for recommendation, dynamic feeds, object storage, and danmaku. Optimizations such as Bulkload reduce full‑load time from hours to minutes, and custom fixed‑length list engines handle high‑volume history data.

Challenges encountered:

Compaction delays for expired keys – solved with periodic compaction checks.

SCAN slowdown due to delete markers – mitigated by delete‑threshold‑triggered compaction and delayed deletion.

Large‑value write amplification – addressed by KV‑storage separation.

Raft replica loss – automatic downgrade to single‑replica mode and scripted recovery.

Log flushing overhead – batch flushing after reaching size or count thresholds improves throughput 2–3×.

Summary & Reflections

Future directions include tighter KV‑cache integration, Sentinel mode for replica cost reduction, improved slow‑node detection, automated disk‑balance mechanisms, and performance tuning via SPDK/DPDK to further boost KV throughput.

Thank you for listening.

backend engineeringscalabilityHigh AvailabilityDistributed StorageBilibiliKV
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.