How Bilibili Scaled Its KV Storage to Handle Explosive Traffic: Design, Challenges, and Solutions
This article explains how Bilibili’s KV storage system evolved from early Redis/Memcache solutions to a highly available, 100‑times scalable architecture using Raft, sharding, multi‑active disaster recovery, and performance optimizations to support massive traffic peaks and diverse business scenarios.
Storage Evolution
Early Bilibili KV storage used Redis/Memcache, Redis + MySQL, and HBase. Rapid data growth caused three main bottlenecks:
MySQL instances could not hold tables larger than 10 TB.
Redis Cluster’s gossip‑based scaling incurred high communication overhead and long state‑inconsistency windows.
HBase exhibited long‑tail latency and high memory cost for caching.
To overcome these limits the system defined five requirements:
Easy horizontal scaling : support 100× capacity increase.
High performance : low latency, high QPS.
High availability : fault‑tolerant, self‑healing.
Low cost : cheaper than pure caching solutions.
Strong reliability : no data loss.
Design and Implementation
Overall Architecture
The service consists of three layers:
Client : accessed via an SDK; exposes simple put / get APIs.
Metaserver : stores table metadata (shard placement) and resource information; performs health checking, load detection, load balancing, shard splitting, and Raft leader election. Metadata is persisted in RocksDB.
Node : hosts replicas of shards. Each shard is replicated using Raft to guarantee consistency and availability.
Cluster Topology
Resources are organized hierarchically:
Pool : logical resource pool (online/offline) based on business isolation.
Zone : fault‑isolation unit; replicas of a shard are placed in different zones.
Node : storage server containing multiple disks; runs one or more replicas.
Shard : a range or hash partition of a table (default size 24 GB). Large tables are split into multiple shards.
Metaserver Functions
Resource management – tracks pools, zones, node counts.
Metadata distribution – records which shards reside on which nodes.
Health checking – monitors node status, disk health, and triggers self‑healing.
Load detection – collects CPU, memory, and disk usage metrics.
Load balancing – redistributes data when usage thresholds are exceeded.
Shard split management – creates new shards when existing ones exceed capacity.
Raft leader election – ensures continuity when a Metaserver fails.
RocksDB persistence – stores metadata reliably.
Node Components
Backend threads – manage binlog (write‑ahead logging for crash recovery), cold‑data off‑load to object storage (e.g., S3), data reclamation, health checks, and RocksDB compaction.
RPC interface – expose metrics (QPS, latency) and enforce per‑tenant quota for multi‑tenant isolation.
Abstract engine layer – selects storage engines (e.g., large‑value engine) based on workload characteristics.
Data Splitting and Balancing
Shards start at 24 GB. When a shard exceeds its limit, the system performs a split using either range or hash partitioning:
Create a RocksDB checkpoint (≈1 ms) to snapshot the current data.
Update shard metadata on the Metaserver.
Apply a Compaction Filter to reclaim obsolete data.
Distribute the new shards across nodes according to load‑balancing thresholds.
The entire process runs in milliseconds, avoiding the per‑key copy overhead of traditional master‑slave replication.
Multi‑Active Disaster Recovery
Writes are replicated via binlog to geographically separated sites (e.g., Cloud‑Cube and Jiading). If a site becomes unavailable, a proxy redirects traffic to the surviving site, providing continuous read/write availability without data loss.
Typical Use Cases
User profiling & recommendation : KV stores billions of user‑profile rows. Bulkload imports RocksDB snapshots from object storage, reducing a 6‑7 hour full import to under 10 minutes.
Fixed‑length list engine : a custom engine maintains a bounded list (e.g., recent playback history). When the length limit is reached, the oldest entries are evicted automatically.
Operational Challenges
Storage‑Engine Issues
Compaction latency for expired keys : expired entries may remain in lower LSM levels. A periodic compaction scan checks TTL and removes stale keys.
Delete‑induced SCAN slowdown : massive delete markers degrade scan performance. The system triggers compaction when delete‑to‑total‑key ratio exceeds a configurable threshold and also employs delayed‑deletion metrics to avoid excessive write amplification.
Write amplification for large values : large‑value payloads cause high write amplification in LSM. A dedicated large‑value engine stores such payloads separately, reducing amplification.
Raft‑Related Issues
Replica reduction during severe failures : a background script can temporarily downgrade a three‑replica group to a single replica, keeping the service alive while a recovery process rebuilds the missing replicas.
Log‑flushing aggregation : writes are buffered until a size or count threshold is reached, then flushed in batches. This improves throughput 2–3× for high‑QPS, small‑value workloads.
Future Directions
Integrate KV cache more tightly with application caches.
Support Sentinel‑mode to lower replica costs.
Improve slow‑node detection and automate disk‑balancing after failures.
Leverage SPDK/DPDK for low‑level network and storage optimizations to further increase KV process throughput.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
