Ensuring High Availability of Bilibili Live Gift System: Panel, Gift Feeding, and Multi‑Active Architecture
Bilibili guarantees 99.99% availability for its live‑gift ecosystem by pre‑loading panel data, using circuit‑breakers and cached responses, sharding databases, auto‑scaling Kubernetes pods, employing Redis hot‑key caching and distributed rate limiting, handling order timeouts and idempotent MQ processing, and deploying multi‑active cross‑region services with intelligent discovery.
Based on the Q4 2022 financial report, Bilibili’s live streaming peaked at 330 million concurrent viewers during the New Year’s gala. Gift (prop) feeding is a critical revenue driver, and the article explains how to achieve 99.99% system stability for the live‑gift ecosystem.
1. Gift Panel
The panel displays all available gifts when users click the gift icon in the live room. To avoid loading delays, the panel data is pre‑loaded as soon as a user enters the room.
Challenges:
Different tabs (privileged/custom) depend on user‑specific interfaces that may be unstable.
During large events, sudden spikes in room entry traffic can exceed interface TPS.
Solutions (circuit‑breaker + degradation):
If a privileged‑gift interface does not respond within 50 ms, the system degrades to an empty response; if failure rate > 50 %, the circuit breaker trips for a period.
When traffic surges, the gateway detects hot rooms and serves cached panel data directly from memory, ensuring the panel remains usable even if some users temporarily lose privileged gifts.
2. Gift Feeding (Sending Gifts to Anchors)
The gift panel offers various items (blind boxes, treasure chests, etc.) and supports combo actions. The underlying revenue‑center includes order, product, settlement, and other business systems.
Key challenges:
Database instability (timeouts, hardware failures).
Traffic spikes overwhelming order processing capacity.
Order timeout inconsistencies.
Message‑queue (MQ) problems.
Database Instability
Root causes are poor schema design and hardware issues. Mitigations include:
Cluster isolation: separate clusters per UID to limit impact.
Sharding: split orders into 10 databases by UID, then further partition by month.
Daily monitoring: alert on slow SQL, prioritize fixes, and use master‑slave failover for MySQL crashes.
Traffic Surge
For large events, full‑stack load testing is performed in advance. Real‑time solutions:
Kubernetes + HPA for automatic pod scaling.
Redis hot‑key detection and fallback to in‑memory cache, optionally using sync.singleflight to limit concurrent fetches.
Distributed rate limiting via a quota‑server.
Order Timeout
Two main issues:
Cross‑process timeout (default 250 ms). Since order DB operations are slower, the timeout is detached to avoid false failures.
Successful payment but timeout response. The system performs a delayed (2 s) order status re‑check; if still unpaid, it returns failure. For mismatched cases, an audit triggers manual refund verification.
Message Queue Issues
Potential problems: data delay, duplication, loss. Mitigations include:
Alerting on missing settlement data.
Idempotent handling via unique DB indexes or distributed locks.
Manual commit with dead‑letter queue for unrecoverable messages.
3. Multi‑Active Deployment
Multi‑active ensures business continuity even if an entire data center fails. Bilibili currently operates same‑city multi‑active with cross‑region calls to Redis, DB, and MQ. The next phase prioritizes local‑region resources.
Redis Strategies
Master‑slave replication per data center.
Read‑write separation to reduce latency.
Active‑Active mode (synchronization across centers) – not used due to consistency complexity.
Bilibili’s approach treats Redis as a cache without cross‑region sync, using the internal Taishan KV store (Raft‑based) for features requiring strong consistency (distributed locks, etc.).
Database Strategy
Primary‑secondary with cross‑region sync. Weak consistency reads from local replicas; strong consistency (e.g., payments) reads from the primary, accepting higher latency.
MQ Strategy
Same‑region production/consumption.
Cross‑region consumption via Kafka Mirror, tagging messages by region to allow consumers to filter.
Key considerations: message ordering across regions, hot‑standby recovery.
Service Discovery
Uses Bilibili’s internal discovery framework to prefer local‑region services.
In summary, achieving four‑nines availability for the live‑gift system involves a combination of circuit breaking, rate limiting, multi‑level caching, Kubernetes auto‑scaling, sharding, master‑slave isolation, multi‑active deployment, and service discovery.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.