How Bilibili Achieves 99.99% Availability for Live Gift Systems
This article explains Bilibili's technical strategies—preloading, circuit breaking, sharding, multi‑active deployment, and Kubernetes auto‑scaling—that ensure the live‑gift panel, feeding flow, and supporting services maintain 99.99% uptime even during massive traffic spikes.
1. Gift Panel
The panel preloads all gift data when a user enters a live room to eliminate loading delays. Two main reliability problems arise:
Interface instability for user‑specific gift tabs can degrade the UI.
Massive traffic spikes (e.g., during large events) can exceed the TPS of the dependent interfaces.
Mitigation: Apply a circuit‑breaker and downgrade strategy. If a privileged‑gift interface exceeds 50 ms latency, the service returns an empty payload. If the failure rate of that interface exceeds 50 %, the circuit‑breaker trips for a configurable cooldown period. During downgrade the panel serves cached data stored in memory, guaranteeing basic functionality for all users.
2. Gift Feeding
The feeding subsystem processes various gift types (blind boxes, treasure chests, etc.) and integrates with the revenue middle platform (order, product, settlement). Room information is a critical foundation and must be highly available.
Key failure modes
Database instability : timeouts, hardware failures, or schema bottlenecks.
Traffic surges : spikes that exceed order‑processing capacity.
Order timeout : the order service may report timeout even though payment succeeded.
Message‑queue anomalies : delayed, duplicated, or lost messages.
Database resilience
We isolate clusters and shard data by user ID:
10 logical MySQL shards, each further partitioned into monthly tables (uid % 10).
Daily monitoring of slow SQL; alerts trigger automated remediation.
Master‑slave failover with automatic promotion.
Handling traffic spikes
Kubernetes Horizontal Pod Autoscaler (HPA) automatically scales pods based on CPU/RAM and custom metrics. Additional safeguards:
Redis hotspot‑key detection; hot keys are migrated to in‑memory caches.
In‑memory cache for hot room data, served directly when the gateway detects a hotspot.
Use of sync.SingleFlight to coalesce concurrent Redis reads (e.g., 100 concurrent pods become a single request).
Order timeout mitigation
Cross‑process timeout checks are detached from the main request path. After a timeout, a delayed background job re‑checks the order status. If payment succeeded but the order was not recorded, an alert is raised for manual reconciliation.
Message‑queue reliability
We employ the following patterns:
Alert on delayed data streams.
Idempotent processing via unique database indexes or distributed locks.
Avoid manual commit that could block the queue; use automatic commit with dead‑letter handling for messages that exceed retry limits.
3. Multi‑Active Deployment
Multi‑active architecture ensures service continuity across data centers, even when an entire site fails. Bilibili currently operates a same‑city multi‑active setup.
Redis strategy
Typical Redis multi‑active options include master‑slave replication, read‑write separation, and active‑active synchronization. Bilibili treats Redis purely as a cache and does not synchronize data across sites. Persistent and strongly consistent key‑value storage is provided by the internally developed Taishan KV store, which is Raft‑based and supports both eventual and strong consistency.
Database strategy
Primary‑secondary replication spans data centers. Reads are classified by consistency requirements:
Weak consistency : read from the local replica (≈2 ms latency) for non‑critical data.
Strong consistency : read from the primary (cross‑site) for critical operations such as payments, accepting higher latency.
Message‑queue strategy
Within a data center, producers and consumers are colocated. For cross‑site consumption, Kafka Mirror replicates topics to other regions. Each message carries a room‑tag; local consumers filter out messages from other sites.
Service discovery
Bilibili’s internal discovery system prefers services in the same data center. Documentation:
https://github.com/bilibili/discovery/blob/master/doc/intro.mdSummary of high‑availability measures
To achieve the 99.99 % availability target for the live‑gift ecosystem, the following techniques are combined:
Circuit‑breaker and graceful downgrade for unstable user‑specific interfaces.
Rate limiting and hotspot detection.
Multi‑level caching (memory, Redis, Taishan KV).
Kubernetes HPA for automatic scaling.
Database sharding and master‑slave isolation.
Same‑city multi‑active deployment for Redis, DB, and MQ.
Internal service discovery to route traffic to local instances.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
