Operations 16 min read

Technical Assurance for High‑Write Live‑Streaming Gift Scenarios

The technical‑assurance team secured Bilibili’s high‑write live‑stream gift system by expanding capacity, isolating hot keys, refactoring pipelines, adding asynchronous writes, employing horizontal scaling and full‑link load testing, converting uncertain dependencies into graceful fallbacks, and deploying dual‑active, chaos‑engineered disaster‑resilience architecture aligned with business usage patterns.

Bilibili Tech

Nov 19, 2022

Technical Assurance for High‑Write Live‑Streaming Gift Scenarios

In the final stage of the S12 finals, the technical assurance team reflected on how to guarantee the stability of the high‑write revenue‑driving gift‑sending feature in Bilibili live streams. The article focuses on the question: How to provide technical assurance for high‑write scenarios?

Scenario Overview

The gift‑sending flow is similar to an e‑commerce transaction: users spend real money to receive instant visual effects or to trigger refunds. This chain requires high stability and consistency. The business can be summarized as three pillars: a gift‑centric transaction system, a consumption‑based identity metric system, and an activity‑driven value‑added system.

Both synchronous and asynchronous paths are write‑heavy, with amplification factors ranging from a few to dozens, making latency, consistency, and real‑time guarantees critical.

Technical Assurance Starting Point

Capacity improvement is the first baseline: anticipate traffic peaks (especially during large events) and ensure the system can handle them. The team emphasizes two methods: (1) estimate peak load and provision resources with a safety margin, applying throttling if exceeded; (2) introduce asynchronous write pipelines to trade latency for higher throughput.

Core Variables

The primary variable is traffic pressure during peak moments, which can trigger cascading failures across downstream services. The goal is to increase capacity to meet this pressure.

Revisiting Murphy’s Law

Beyond capacity, the team identifies potential failure points: network switches, physical machines, load balancers, dependent services, and unexpected incidents. Since these cannot be fully predicted, the strategy combines pre‑emptive mitigation (strengthening/weakening dependencies) and rapid recovery mechanisms.

Methods of Technical Assurance

1. Capacity Enhancement

• Resource review – selective migration of MySQL instances to ensure each master runs on a dedicated physical machine.

• Database load reduction – cut unnecessary read/write operations and use caching to offload disk I/O and CPU.

• Complexity governance – refactor the gift‑transaction pipeline to eliminate write amplification and tight coupling, introducing single‑responsibility wallet and settlement services, and abstracting accounting via a clearing layer.

• Hot‑key mitigation – a SDK automatically detects hot Redis keys and redirects them to in‑memory storage.

• Horizontal scalability – a generic data‑refresh component decouples application scaling from database load, allowing independent scaling of services.

• Full‑link load testing – isolated traffic mirroring for cache, DB, and message queues prevents data pollution while validating end‑to‑end performance under realistic load.

2. Turning Uncertainty into Certainty

Key secondary variables are slow queries and DB connection counts. The team lowered slow‑query thresholds, routed read traffic through a DB proxy, and enforced connection limits.

Strong vs. weak dependencies are distinguished; weak dependencies receive short timeouts and can be degraded, while strong dependencies trigger fallback strategies such as traffic shifting via Bilibili’s Invoker multi‑active platform.

Data consistency is achieved through near‑real‑time reconciliation and daily batch reconciliation, providing eventual consistency for gifts, orders, and wallets.

Chaos engineering injects faults at critical points to verify that automatic refunds and fallback mechanisms work as expected.

3. Disaster Resilience

Infrastructure failures (e.g., data‑center power loss, SLB glitches) are mitigated through redundancy and geographic dispersion. A dual‑active (同城双活) architecture provides two primary traffic paths sharing a replicated DB cluster; failover is automated, but for strong‑consistency services like wallets, manual decision‑making is required.

Operational readiness includes traffic‑switch drills and detailed analysis of potential data loss during master‑slave switchover.

Conclusion

The quality of technical assurance depends on two factors: (1) continuous investment in infrastructure (dual‑active sites, full‑link load testing, chaos engineering, traffic‑control platforms) and (2) business‑driven implementation that aligns technical safeguards with real‑world usage patterns. The article pays tribute to the team members who made the S12 guarantee possible.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Live Streaming SRE capacity planning database scaling high write technical assurance

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.