How Bilibili Scales Its Like Service: Architecture, Storage, and Disaster Recovery
This article details Bilibili's thumb‑up system design, covering business capabilities, multi‑layer storage, traffic handling, disaster‑recovery strategies, and future plans to ensure a high‑traffic, reliable like service for videos, posts, comments, and more.
1. Introduction
Bilibili’s "thumb‑up" feature lets users like or dislike various entities such as videos, dynamic posts, columns, comments, and danmaku, forming a special bond between creators and fans.
2. Required System Capabilities
Business capabilities (example: article likes) include:
Like/unlike and dislike/undislike a specific item.
Query like status for a single item or a batch.
Retrieve total like count for an item.
Get a user's liked items list.
Get the list of users who liked an item.
Query a user's total received likes.
Platform capabilities focus on rapid onboarding (configuration‑level) and multi‑tenant data isolation in storage (cache and DB).
Disaster‑recovery capabilities address failures such as:
DB unavailability – fallback to cache.
Cache unavailability – fallback to DB.
Message‑queue outage – automatic downgrade via RPC (Railgun).
Data‑center failure – switch to another site.
Data‑sync delays (e.g., TiDB replication lag) causing count inconsistencies.
Other unknown issues like downstream service crashes or message backlogs.
3. Traffic and Storage Pressures
Global traffic pressure : read queries (like‑status, count) exceed 300 k QPS, write operations (like/dislike) exceed 15 k QPS. To reduce DB I/O, like‑count writes are aggregated in memory (e.g., 10‑second windows) before persisting.
Asynchronous processing ensures the database can handle writes at a reasonable rate. Before updating a like status, the service fetches the previous state to guarantee correctness.
Hotspot pressure : popular items generate DB and cache hotspots. A hotspot‑detection mechanism moves hot keys to local memory with a configurable TTL, as described in the internal article https://mp.weixin.qq.com/s/C8CI-1DDiQ4BC_LaMaeDBg.
Data volume pressure : the system stores over a hundred billion like records, prompting a shift toward KV‑style storage to balance cost and performance.
4. Overall System Architecture
The thumb‑up service is divided into five layers:
Traffic routing layer – decides which data‑center receives a request.
Business gateway layer – handles authentication, anti‑fraud filtering, etc.
Thumbup service (thumbup‑service) – provides unified RPC interfaces.
Asynchronous job layer (thumbup‑job) – processes background tasks.
Data layer – includes DB, KV store, and Redis cache.
The diagram below illustrates the full architecture:
5. Three‑Tier Data Storage
DB layer (TiDB) stores two core tables:
Likes table – records each like event (user ID, entity ID, type, timestamp) with a composite index on user and entity.
Counts table – aggregates like/dislike totals per business ID and entity ID, indexed for fast queries.
TiDB’s distributed nature removes the need for manual sharding.
Cache layer (Redis) follows a Cache‑Aside pattern. Key designs include:
key-value = count:patten:{business_id}:{message_id} - {likes},{disLikes}and
key-value = user:likes:patten:{mid}:{business_id} - member(messageID)-score(likeTimestamp)The user‑like list is stored as a sorted set (ZSet) with timestamps as scores. To bound cache size, the list is trimmed to a fixed length on each insertion, with overflow reads falling back to the DB.
Local cache (in‑process memory) mitigates cache‑hotspot issues by tracking hot keys using a min‑heap within a configurable time window and caching them locally with an acceptable TTL.
KV migration (Taishan) reduces TiDB storage costs and provides an additional disaster‑recovery copy. Data is organized as:
1_{mid}_${business_id}_${type}_${message_id} => {origin_id}_{mtime} 2_{mid}_${business_id}_${type}_${mtime}_{message_id} => {origin_id} 3_{message_id}_${business_id}_${type}_${mtime}_${mid} => {origin_id}6. Service Layer (thumbup‑service)
The service runs in two data‑centers with active‑passive DB proxy failover. When one DB fails, traffic is switched to the backup site. Two independent Redis clusters (one per site) are kept in sync via asynchronous jobs that consume TiDB binlog events.
Critical APIs (like status, count, list) have fallback data: if all caches fail, the KV store serves the request; if KV also fails, TiDB serves with rate‑limiting. All writes include retry mechanisms, and occasional inconsistencies are tolerated given the high‑volume nature of likes.
7. Asynchronous Job Layer (thumbup‑job)
Jobs handle:
Persisting user actions (likes, dislikes, cancellations) to DB.
Refreshing caches for like status, lists, and counts.
Publishing asynchronous messages for downstream services.
Write‑back strategies are considered for future scaling: currently both DB and cache writes occur in the async flow; later, writes may target cache first, then persist asynchronously.
Binlog reliability is monitored; if delays or breaks occur, the service falls back to sending critical events directly from thumbup‑service, ensuring downstream consumers still receive updates.
8. Future Plans
Modularize the thumb‑up service into finer‑grained units.
Platform‑ize the service to support custom data‑partition isolation for different business lines.
Explore new business models derived from the like interaction, extending beyond simple engagement metrics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
