How Bilibili Scaled Its Private Messaging System to Handle 10× Traffic
This article analyzes Bilibili's private messaging architecture, identifies performance bottlenecks such as MySQL slow queries, storage limits, and tightly coupled services, and presents a multi‑layer redesign with caching, sharding, BFF separation, and consistency mechanisms to sustain tenfold traffic growth.
1. Current Situation
The private messaging service is divided into four business lines—customer service, system notifications, interactive notifications, and private messages—each with its own service layers. Private messages include single‑chat, batch B2C messages, group chats, and fan‑assistant chats, which are not technically decoupled, causing scalability issues.
Key Concepts
Conversation List : List of chat partners shown on the homepage envelope icon.
Conversation Detail : Atomic state of a chat, including sender/receiver IDs, unread count, and ordering.
Conversation History : Time‑ordered list of messages for a chat; shared across participants in group chats.
Inbox : KV store mapping a send event to a unique message ID for sequential reads.
Message Content : Unique ID generated by a snowflake‑style allocator, containing the original payload and status.
Timeline Model : Abstract model with message body, read/write pointers, producer/consumer modules for synchronization and indexing.
Read Diffusion : Pull‑based model where a group message is written once to the group inbox and read by all members.
Write Diffusion : Push‑based model where a single‑chat updates both sender and receiver conversation states.
2. Technical Problems
2.1 Conversation Slow Queries – When the conversation cache expires, MySQL becomes the sole source. High QPS, limited connections, and large table sizes (up to 200 million rows for top users) cause seconds‑level latency and empty conversation pages. Adding more pods or indexes only provides short‑term relief.
2.2 Message Content Storage Ceiling – Each message has a unique msgkey containing a timestamp, used for sharding. Quarterly sharding with monthly tables leads to billions of rows per table; peak write QPS reaches 20 k, far exceeding the safe limit of 3 k QPS per MySQL instance.
2.3 Service Coupling – Single‑chat, group chat, batch messages, and fan‑assistant share the same code paths for sending and delivery, increasing complexity and causing resource contention when traffic spikes.
3. Upgrade Path
3.1 Overall Architecture Redesign
The system is split into four layers: Access (BFF and gateway), Business (complex query services), Platform (IM‑style real‑time delivery), and Reach (long‑link and push integration). This separates heavy list queries from real‑time messaging.
3.2 Local Cache Degradation
Critical UI paths (e.g., payment assistant, official notifications) cache data on the client to avoid blank pages during extreme load.
3.3 BFF Layer Refactor
The BFF absorbs business logic, exposing five new services: single‑chat, group‑chat, system notification, interactive notification, and message settings, improving microservice health.
3.4 Backend Availability Enhancements
Focus on Business and Platform layers:
Business Layer : Multi‑level cache (Redis for hot data, Taishan for limited details, MySQL for full data) and read‑write splitting (95% of complex queries go to replicas).
Platform Layer : Timeline model based on snowflake IDs, read diffusion for groups, write diffusion for single chats.
3.5 Single‑Chat Optimizations
Cache pre‑warming for high‑impact users (large UPs) based on offline analysis.
Monitor hot conversation counts and trigger automatic cache warm‑up.
Dual persistence with Taishan and MySQL; Taishan serves as first fallback after Redis miss.
Consistency ensured via Redis Lua scripts implementing compare‑and‑swap on binlog timestamps (mtime).
3.6 Inbox Refactor
New inbox uses Redis + Taishan; Redis holds 24‑hour hot data, Taishan stores up to 600 entries per user (≈20 pages). Misses fall back to MySQL replicas.
3.7 Message Content Refactor
Separate storage for single‑chat content, fully decoupled from group and fan‑assistant schemas.
Asynchronous MySQL writes to raise write QPS ceiling.
Monthly sharding with 100 tables per shard reduces per‑table size from billions to millions.
3.8 Batch Message Improvements
Two channels: normal batch traffic shares a common quota; high‑priority channel expands topic partitions, consumer pods, and connection limits to raise throughput from 3.5 k to 30 k messages per second.
4. Conclusion
Technical upgrades must be incremental, balancing cost and feasibility. The redesign bridges the gap between the existing and ideal architectures, ensures smooth migration, and maintains service stability. Continuous monitoring and iterative optimization are essential for a robust IM system.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
