Backend Development 15 min read

How Bilibili Scaled Its KV Storage to Handle Explosive Traffic: Design, Challenges, and Solutions

This article explains how Bilibili’s KV storage system evolved from early Redis/Memcache solutions to a highly available, 100‑times scalable architecture using Raft, sharding, multi‑active disaster recovery, and performance optimizations to support massive traffic peaks and diverse business scenarios.

ITPUB

Oct 9, 2022

How Bilibili Scaled Its KV Storage to Handle Explosive Traffic: Design, Challenges, and Solutions

Storage Evolution

Early Bilibili KV storage used Redis/Memcache, Redis + MySQL, and HBase. Rapid data growth caused three main bottlenecks:

MySQL instances could not hold tables larger than 10 TB.

Redis Cluster’s gossip‑based scaling incurred high communication overhead and long state‑inconsistency windows.

HBase exhibited long‑tail latency and high memory cost for caching.

To overcome these limits the system defined five requirements:

Easy horizontal scaling : support 100× capacity increase.

High performance : low latency, high QPS.

High availability : fault‑tolerant, self‑healing.

Low cost : cheaper than pure caching solutions.

Strong reliability : no data loss.

Design and Implementation

Overall Architecture

The service consists of three layers:

Client : accessed via an SDK; exposes simple put / get APIs.

Metaserver : stores table metadata (shard placement) and resource information; performs health checking, load detection, load balancing, shard splitting, and Raft leader election. Metadata is persisted in RocksDB.

Node : hosts replicas of shards. Each shard is replicated using Raft to guarantee consistency and availability.

Cluster Topology

Resources are organized hierarchically:

Pool : logical resource pool (online/offline) based on business isolation.

Zone : fault‑isolation unit; replicas of a shard are placed in different zones.

Node : storage server containing multiple disks; runs one or more replicas.

Shard : a range or hash partition of a table (default size 24 GB). Large tables are split into multiple shards.

Metaserver Functions

Resource management – tracks pools, zones, node counts.

Metadata distribution – records which shards reside on which nodes.

Health checking – monitors node status, disk health, and triggers self‑healing.

Load detection – collects CPU, memory, and disk usage metrics.

Load balancing – redistributes data when usage thresholds are exceeded.

Shard split management – creates new shards when existing ones exceed capacity.

Raft leader election – ensures continuity when a Metaserver fails.

RocksDB persistence – stores metadata reliably.

Node Components

Backend threads – manage binlog (write‑ahead logging for crash recovery), cold‑data off‑load to object storage (e.g., S3), data reclamation, health checks, and RocksDB compaction.

RPC interface – expose metrics (QPS, latency) and enforce per‑tenant quota for multi‑tenant isolation.

Abstract engine layer – selects storage engines (e.g., large‑value engine) based on workload characteristics.

Data Splitting and Balancing

Shards start at 24 GB. When a shard exceeds its limit, the system performs a split using either range or hash partitioning:

Create a RocksDB checkpoint (≈1 ms) to snapshot the current data.

Update shard metadata on the Metaserver.

Apply a Compaction Filter to reclaim obsolete data.

Distribute the new shards across nodes according to load‑balancing thresholds.

The entire process runs in milliseconds, avoiding the per‑key copy overhead of traditional master‑slave replication.

Multi‑Active Disaster Recovery

Writes are replicated via binlog to geographically separated sites (e.g., Cloud‑Cube and Jiading). If a site becomes unavailable, a proxy redirects traffic to the surviving site, providing continuous read/write availability without data loss.

Typical Use Cases

User profiling & recommendation : KV stores billions of user‑profile rows. Bulkload imports RocksDB snapshots from object storage, reducing a 6‑7 hour full import to under 10 minutes.

Fixed‑length list engine : a custom engine maintains a bounded list (e.g., recent playback history). When the length limit is reached, the oldest entries are evicted automatically.

Operational Challenges

Storage‑Engine Issues

Compaction latency for expired keys : expired entries may remain in lower LSM levels. A periodic compaction scan checks TTL and removes stale keys.

Delete‑induced SCAN slowdown : massive delete markers degrade scan performance. The system triggers compaction when delete‑to‑total‑key ratio exceeds a configurable threshold and also employs delayed‑deletion metrics to avoid excessive write amplification.

Write amplification for large values : large‑value payloads cause high write amplification in LSM. A dedicated large‑value engine stores such payloads separately, reducing amplification.

Raft‑Related Issues

Replica reduction during severe failures : a background script can temporarily downgrade a three‑replica group to a single replica, keeping the service alive while a recovery process rebuilds the missing replicas.

Log‑flushing aggregation : writes are buffered until a size or count threshold is reached, then flushed in batches. This improves throughput 2–3× for high‑QPS, small‑value workloads.

Future Directions

Integrate KV cache more tightly with application caches.

Support Sentinel‑mode to lower replica costs.

Improve slow‑node detection and automate disk‑balancing after failures.

Leverage SPDK/DPDK for low‑level network and storage optimizations to further increase KV process throughput.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

scalability KV storage Bilibili

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.