Design and Implementation of a Multi‑Level Comment Storage System for Bilibili
This article presents a comprehensive design of Bilibili's comment service architecture, detailing the transition from TiDB to a multi‑level storage system based on Taishan KV, the data models, consistency mechanisms, retry and versioning strategies, and a hedging‑based degradation policy to ensure high availability under heavy traffic.
1. Background
The comment system is a core component of Bilibili's ecosystem, influencing user interaction, content recommendation, community culture, and overall platform stability. During hot events, comment traffic spikes dramatically, stressing the service and making high cache hit rates essential; cache misses lead to direct TiDB queries, risking service outages.
2. Architecture Design
The existing architecture relies on Redis for caching and TiDB for storage, using various sorted indexes (likes, time, hotness) stored in Redis Sorted Sets. When cache misses occur, TiDB queries become slow, consuming CPU and memory and degrading overall performance.
To avoid TiDB single‑point failures, a new multi‑level storage system built on Bilibili's self‑developed Taishan KV is introduced, converting structured indexes to unstructured storage and SQL queries to high‑performance NoSQL operations.
SELECT id FROM reply WHERE ... ORDER BY like_count DESC LIMIT m,n3. Storage Design
Two abstract data models are defined: Index (sorted indexes) and KV (comment material). The table below compares TiDB and Taishan representations:
Abstract Data Model
TiDB Model
Taishan Model
Description
Index
Secondary Index
Sorted Set
Sorted indexes such as likes or time order
KV
Primary Key & Row
Key‑Value
Metadata and content of a comment
The Index+KV model enables efficient pagination and real‑time updates using Redis Sorted Sets for indexes and KV stores for comment data.
4. Data Consistency
Switching from TiDB's structured data to Taishan's unstructured format lacks ready‑made sync tools, leading to potential data loss, write failures, conflicts, out‑of‑order writes, and latency. To mitigate these issues, a retry queue is introduced for failed writes, and a version‑number mechanism is applied to enforce ordering.
UPDATE reply SET like_count=like_count+
1
, version=version+
1
WHERE id = xxxDuring CAS operations, the version from the binlog is compared with the stored version; updates proceed only if the incoming version is greater or equal, discarding stale data.
5. Degradation Strategy
Given the strict availability requirements, a hedging policy is adopted: after a configurable timeout on the primary store (TiDB or Taishan), a delayed backup request is sent to the secondary store. This balances response time and resource consumption, outperforming simple serial or parallel fallback strategies.
Degradation Strategy
Advantages
Disadvantages
Serial
Simple
Long latency, may exceed upstream timeout
Parallel
Short latency
Consumes roughly double the request load
In production, TiDB is set as primary for latency‑sensitive, lightweight queries, while Taishan serves as primary for heavy analytical workloads. An incident where TiKV nodes failed demonstrated seamless automatic downgrade to Taishan, keeping the comment service stable.
6. Summary and Outlook
The comment service is vital for community engagement on Bilibili. Continuous improvements in storage reliability, consistency mechanisms, and degradation policies aim to deliver a smoother user experience and foster stronger community ties.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.