How Bilibili Scaled Its Private Messaging System to Handle 10× Traffic
This article analyzes the current bottlenecks of Bilibili's private messaging service, explains the technical challenges of massive data volume and traffic spikes, and presents a comprehensive multi‑layer architecture upgrade—including cache strategies, BFF refactoring, database sharding, and consistency mechanisms—to ensure scalability and reliability.
1. Current Situation
The private messaging service is divided into four business lines—customer service, system notifications, interactive notifications, and private messages. Within private messages there are four sub‑types: single‑chat, batch B‑to‑C messages, group chat, and assistant bots. All sub‑types share the same storage and code paths, leading to tight coupling and scalability limits.
Private‑chat conversion rates are below 10% for both page‑views and unique users, indicating significant room for business‑driven optimization.
2. Problems
Conversation Slow Queries
When conversation caches expire, MySQL becomes the sole source. The database is sharded by uid%1000/100 (1000 databases, each with 100 tables). Large‑volume users can accumulate over 10 million rows in a single table, causing severe data skew and slow queries that often exceed seconds, leading to empty conversation pages.
Adding more indexes or pods only provides short‑term relief and cannot accelerate single‑request latency; it also slows down writes.
Private Message Content Storage Limits
Each message has a unique msgkey generated by a Snowflake‑style ID containing a timestamp. Data is partitioned quarterly by database and monthly by table, resulting in billions of rows per table and a projected 100 billion rows over 25 years. The current MySQL instance peaks at 790 QPS, while peak write traffic can reach 20 k QPS, far exceeding the safe limit of 3 k QPS.
The table also stores attributes for group chat, single chat, and assistant bots, making schema evolution difficult.
Server‑Side Code Coupling
All four private‑message types share the same sending and delivery logic and storage, causing code complexity to grow with each new feature and creating resource contention between services.
3. Upgrade Path
Overall Architecture Design
A four‑layer architecture is proposed:
Access Layer : BFF and gateway for C‑side traffic.
Business Layer : Handles complex queries for various business scenarios.
Platform Layer : IM‑style real‑time delivery, scalable and ordered.
Delivery Layer : Long‑link and push integration.
Client‑Side Cache Degradation
Critical UI paths (e.g., payment assistant, official account notifications) should retain cached data locally to avoid a blank screen even under extreme load.
BFF Architecture Upgrade
The BFF absorbs rising business logic, routing requests to five new micro‑services: single‑chat, group‑chat, system notifications, interactive notifications, and message settings. This decouples old monolithic code and improves micro‑service health.
Server Availability Enhancements
After splitting the four layers, focus shifts to the Business and Platform layers:
Cold‑hot separation: Redis (hot, TTL) → Taishan (limited detail) → MySQL (full data).
Read‑write splitting: >95% of complex queries served by read replicas.
Timeline model based on Snowflake IDs for ordered delivery.
Write diffusion for single‑chat, read diffusion for group chat.
Single‑Chat Optimizations
Active Cache Pre‑warming : Asynchronously create conversation caches when a user’s unread count event is captured, focusing on high‑impact users (large UPs).
Redis + Taishan Dual Persistence : Redis stores 24‑hour data; Taishan stores the latest 600 entries per user, falling back to MySQL only when both miss.
Consistency Guarantees : Use Redis Lua CAS to ensure monotonic binlog processing based on millisecond timestamps (mtime). Binlog entries with older timestamps are discarded, preventing out‑of‑order updates.
Batch Message Enhancements
Normal Channel : Shared quota for routine batch messages.
High‑Priority Channel : Scale topic partitions, pod count, and proxy connections to raise throughput from 3.5k to 30k users per second for special events.
4. Conclusion
Technical upgrades are incremental; set realistic goals, evaluate trade‑offs, and ensure smooth migration between old and new architectures. Continuous monitoring of performance metrics is essential to maintain stability while the legacy system is gradually phased out.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
