Operations 14 min read

How Bilibili Achieves 99.99% Availability for Live Gift Systems

This article explains Bilibili's technical strategies—preloading, circuit breaking, sharding, multi‑active deployment, and Kubernetes auto‑scaling—that ensure the live‑gift panel, feeding flow, and supporting services maintain 99.99% uptime even during massive traffic spikes.

dbaplus Community

Jan 14, 2024

How Bilibili Achieves 99.99% Availability for Live Gift Systems

1. Gift Panel

The panel preloads all gift data when a user enters a live room to eliminate loading delays. Two main reliability problems arise:

Interface instability for user‑specific gift tabs can degrade the UI.

Massive traffic spikes (e.g., during large events) can exceed the TPS of the dependent interfaces.

Mitigation: Apply a circuit‑breaker and downgrade strategy. If a privileged‑gift interface exceeds 50 ms latency, the service returns an empty payload. If the failure rate of that interface exceeds 50 %, the circuit‑breaker trips for a configurable cooldown period. During downgrade the panel serves cached data stored in memory, guaranteeing basic functionality for all users.

2. Gift Feeding

The feeding subsystem processes various gift types (blind boxes, treasure chests, etc.) and integrates with the revenue middle platform (order, product, settlement). Room information is a critical foundation and must be highly available.

Key failure modes

Database instability : timeouts, hardware failures, or schema bottlenecks.

Traffic surges : spikes that exceed order‑processing capacity.

Order timeout : the order service may report timeout even though payment succeeded.

Message‑queue anomalies : delayed, duplicated, or lost messages.

Database resilience

We isolate clusters and shard data by user ID:

10 logical MySQL shards, each further partitioned into monthly tables (uid % 10).

Daily monitoring of slow SQL; alerts trigger automated remediation.

Master‑slave failover with automatic promotion.

Handling traffic spikes

Kubernetes Horizontal Pod Autoscaler (HPA) automatically scales pods based on CPU/RAM and custom metrics. Additional safeguards:

Redis hotspot‑key detection; hot keys are migrated to in‑memory caches.

In‑memory cache for hot room data, served directly when the gateway detects a hotspot.

Use of sync.SingleFlight to coalesce concurrent Redis reads (e.g., 100 concurrent pods become a single request).

Order timeout mitigation

Cross‑process timeout checks are detached from the main request path. After a timeout, a delayed background job re‑checks the order status. If payment succeeded but the order was not recorded, an alert is raised for manual reconciliation.

Message‑queue reliability

We employ the following patterns:

Alert on delayed data streams.

Idempotent processing via unique database indexes or distributed locks.

Avoid manual commit that could block the queue; use automatic commit with dead‑letter handling for messages that exceed retry limits.

3. Multi‑Active Deployment

Multi‑active architecture ensures service continuity across data centers, even when an entire site fails. Bilibili currently operates a same‑city multi‑active setup.

Redis strategy

Typical Redis multi‑active options include master‑slave replication, read‑write separation, and active‑active synchronization. Bilibili treats Redis purely as a cache and does not synchronize data across sites. Persistent and strongly consistent key‑value storage is provided by the internally developed Taishan KV store, which is Raft‑based and supports both eventual and strong consistency.

Database strategy

Primary‑secondary replication spans data centers. Reads are classified by consistency requirements:

Weak consistency : read from the local replica (≈2 ms latency) for non‑critical data.

Strong consistency : read from the primary (cross‑site) for critical operations such as payments, accepting higher latency.

Message‑queue strategy

Within a data center, producers and consumers are colocated. For cross‑site consumption, Kafka Mirror replicates topics to other regions. Each message carries a room‑tag; local consumers filter out messages from other sites.

Service discovery

Bilibili’s internal discovery system prefers services in the same data center. Documentation:

https://github.com/bilibili/discovery/blob/master/doc/intro.md

Summary of high‑availability measures

To achieve the 99.99 % availability target for the live‑gift ecosystem, the following techniques are combined:

Circuit‑breaker and graceful downgrade for unstable user‑specific interfaces.

Rate limiting and hotspot detection.

Multi‑level caching (memory, Redis, Taishan KV).

Kubernetes HPA for automatic scaling.

Database sharding and master‑slave isolation.

Same‑city multi‑active deployment for Redis, DB, and MQ.

Internal service discovery to route traffic to local instances.

microservices multi-active high-availability live-streaming circuit-breaker

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.