How Bilibili Scaled Its KV Store to Handle Billions of Requests
This article explains how Bilibili’s KV storage system evolved from early Redis/Memcache solutions to a custom distributed KV architecture, detailing design goals, architecture components, shard management, Raft replication, multi‑active disaster recovery, and the operational challenges solved to support massive traffic growth.
Storage Evolution
Bilibili initially used a mix of Redis/Memcache, Redis+MySQL, and HBase for different workloads. Rapid data growth exposed several problems: MySQL could not store tables larger than 10 TB, Redis Cluster hit scaling limits due to Gossip‑based communication overhead, and HBase suffered from high tail latency and expensive cache memory.
MySQL’s single‑node limitation made it unsuitable for petabyte‑scale tables; TiDB was considered but did not fit the loosely‑related playback history data.
Redis Cluster’s Gossip protocol caused communication overhead and state inconsistency as the cluster grew.
HBase incurred high tail latency and costly memory usage.
From these pain points, Bilibili defined five core requirements for a new KV store: 100× horizontal scalability, low latency with high QPS, strong availability with self‑healing, lower cost compared to cache layers, and zero data loss.
Design and Implementation
The solution is a three‑tier architecture consisting of Clients, Nodes, and a Metaserver.
Client: SDK‑based entry point exposing simple put and get APIs, abstracting away distribution details.
Metaserver: Stores table metadata, shard‑to‑node mappings, and manages resource pools, zones, health checks, load balancing, split management, Raft leader election, and RocksDB‑backed persistence.
Node: Hosts replicas (Raft‑synchronized) and provides three sub‑components:
Background threads handle binlog management, S3 off‑loading of cold data, and data reclamation.
RPC layer includes metrics (QPS, latency) and quota‑based throttling for multi‑tenant isolation.
Abstract engine layer offers pluggable storage engines (e.g., large‑value engine) to adapt to different workload characteristics.
Key concepts include:
Pool: Separate online and offline resource pools.
Zone: Availability zones for fault isolation.
Shard: Table data split by range or hash.
Metaserver periodically records node health, disk/CPU/memory usage, and triggers automatic rebalancing when thresholds are exceeded.
Shard Splitting and Data Rebalancing
When a shard reaches its 24 GB limit, it is split (range or hash). The Metaserver redirects reads/writes to the original shard until the new shard becomes normal, making the process transparent to clients. Splitting is performed via RocksDB checkpoints (≈1 ms) followed by compaction‑filter‑driven data cleanup.
Multi‑Active Disaster Recovery
To survive whole‑datacenter failures, KV uses binlog replication across geographically separated sites. If one site fails, a proxy automatically routes traffic to the surviving site, ensuring continuous availability.
Real‑World Scenarios and Issues
Typical workloads include user profiling for recommendation, dynamic feeds, object storage, and bullet‑screen comments. Specific challenges addressed:
Bulkload: Offline Hive builds RocksDB images and uploads them to object storage; KV then pulls and loads them in under 10 minutes, down from several hours.
Fixed‑length List: Custom engine maintains a capped list, automatically evicting oldest entries when the length limit is exceeded.
Compaction & Deletion: Periodic compaction checks key expiration; a deletion‑threshold triggers forced compaction to avoid scan‑slowdowns caused by tombstones.
Write Amplification: Large‑value writes cause LSM write amplification; KV separates large values into a dedicated engine to mitigate the effect.
Raft Issues: In extreme cases, the system reduces replica count to a single node to keep service alive, then restores full replication once the failed nodes recover. Log‑flushing is batched based on entry count or size to improve throughput 2–3×.
Takeaways and Future Work
Operationally, Bilibili plans to automate slow‑node detection and disk‑failure rebalancing, which currently rely on manual intervention. Performance tuning continues at the SPDK/DPDK level to further boost KV throughput.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
