Cloud Native 14 min read

Ensuring High Availability of Bilibili Live Gift System: Panel, Gift Feeding, and Multi‑Active Architecture

Bilibili guarantees 99.99% availability for its live‑gift ecosystem by pre‑loading panel data, using circuit‑breakers and cached responses, sharding databases, auto‑scaling Kubernetes pods, employing Redis hot‑key caching and distributed rate limiting, handling order timeouts and idempotent MQ processing, and deploying multi‑active cross‑region services with intelligent discovery.

Bilibili Tech

Jul 11, 2023

Ensuring High Availability of Bilibili Live Gift System: Panel, Gift Feeding, and Multi‑Active Architecture

Based on the Q4 2022 financial report, Bilibili’s live streaming peaked at 330 million concurrent viewers during the New Year’s gala. Gift (prop) feeding is a critical revenue driver, and the article explains how to achieve 99.99% system stability for the live‑gift ecosystem.

1. Gift Panel

The panel displays all available gifts when users click the gift icon in the live room. To avoid loading delays, the panel data is pre‑loaded as soon as a user enters the room.

Challenges:

Different tabs (privileged/custom) depend on user‑specific interfaces that may be unstable.

During large events, sudden spikes in room entry traffic can exceed interface TPS.

Solutions (circuit‑breaker + degradation):

If a privileged‑gift interface does not respond within 50 ms, the system degrades to an empty response; if failure rate > 50 %, the circuit breaker trips for a period.

When traffic surges, the gateway detects hot rooms and serves cached panel data directly from memory, ensuring the panel remains usable even if some users temporarily lose privileged gifts.

2. Gift Feeding (Sending Gifts to Anchors)

The gift panel offers various items (blind boxes, treasure chests, etc.) and supports combo actions. The underlying revenue‑center includes order, product, settlement, and other business systems.

Key challenges:

Database instability (timeouts, hardware failures).

Traffic spikes overwhelming order processing capacity.

Order timeout inconsistencies.

Message‑queue (MQ) problems.

Database Instability

Root causes are poor schema design and hardware issues. Mitigations include:

Cluster isolation: separate clusters per UID to limit impact.

Sharding: split orders into 10 databases by UID, then further partition by month.

Daily monitoring: alert on slow SQL, prioritize fixes, and use master‑slave failover for MySQL crashes.

Traffic Surge

For large events, full‑stack load testing is performed in advance. Real‑time solutions:

Kubernetes + HPA for automatic pod scaling.

Redis hot‑key detection and fallback to in‑memory cache, optionally using sync.singleflight to limit concurrent fetches.

Distributed rate limiting via a quota‑server.

Order Timeout

Two main issues:

Cross‑process timeout (default 250 ms). Since order DB operations are slower, the timeout is detached to avoid false failures.

Successful payment but timeout response. The system performs a delayed (2 s) order status re‑check; if still unpaid, it returns failure. For mismatched cases, an audit triggers manual refund verification.

Message Queue Issues

Potential problems: data delay, duplication, loss. Mitigations include:

Alerting on missing settlement data.

Idempotent handling via unique DB indexes or distributed locks.

Manual commit with dead‑letter queue for unrecoverable messages.

3. Multi‑Active Deployment

Multi‑active ensures business continuity even if an entire data center fails. Bilibili currently operates same‑city multi‑active with cross‑region calls to Redis, DB, and MQ. The next phase prioritizes local‑region resources.

Redis Strategies

Master‑slave replication per data center.

Read‑write separation to reduce latency.

Active‑Active mode (synchronization across centers) – not used due to consistency complexity.

Bilibili’s approach treats Redis as a cache without cross‑region sync, using the internal Taishan KV store (Raft‑based) for features requiring strong consistency (distributed locks, etc.).

Database Strategy

Primary‑secondary with cross‑region sync. Weak consistency reads from local replicas; strong consistency (e.g., payments) reads from the primary, accepting higher latency.

MQ Strategy

Same‑region production/consumption.

Cross‑region consumption via Kafka Mirror, tagging messages by region to allow consumers to filter.

Key considerations: message ordering across regions, hot‑standby recovery.

Service Discovery

Uses Bilibili’s internal discovery framework to prefer local‑region services.

In summary, achieving four‑nines availability for the live‑gift system involves a combination of circuit breaking, rate limiting, multi‑level caching, Kubernetes auto‑scaling, sharding, master‑slave isolation, multi‑active deployment, and service discovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Live Streaming multi-active gift system

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.