Operations 14 min read

Stability Engineering Practices for Large-Scale Live Streaming: Bilibili's S11 World Championship Case Study

To deliver a flawless live broadcast of the 2021 League of Legends S11 World Championship to over 100 million viewers, Bilibili mobilized hundreds of engineers for four months, establishing strict standards, modeling dozens of user scenarios, estimating traffic, conducting layered stress and chaos tests, implementing automated and manual degradation, detailed SOPs, rate‑limiting safeguards, and on‑site monitoring, which together ensured system stability throughout the event.

Bilibili Tech

Mar 4, 2022

Stability Engineering Practices for Large-Scale Live Streaming: Bilibili's S11 World Championship Case Study

On November 7, 2021, the League of Legends S11 World Championship final between Chinese team EDG and Korean team DK attracted over 100 million viewers. Bilibili, the exclusive live‑streaming rights holder, had to ensure a smooth broadcast despite traffic far exceeding expectations.

To guarantee stability, a four‑month effort involving hundreds of engineers was undertaken. The project covered web, mobile, backend, operations, streaming, and procurement, engaging more than eight internal teams and over ten major work streams.

2.1 Overall Planning – The work began in July and continued through the final. Sixteen pre‑defined standards were established, such as stateless data services, a minimum of two instances per service cluster, and clear ownership and SLA for every dependency.

2.2 Scenario Definition – Forty‑plus user scenarios were identified, each described with name, priority (P0‑PX), target platform (iOS, Android, Web, PC), detailed 5W1H description, and interaction diagram.

2.3 Traffic Estimation – For each scenario, expected QPS was calculated using historical traffic models or business‑to‑traffic conversion formulas (e.g., GMV/DAU). An example diagram shows the services and interfaces involved in the “enter live room” scenario.

2.4 Stress Testing – Three rounds of load testing were performed: (1) baseline without scaling to find bottlenecks, (2) scaled testing based on projected S11 traffic, and (3) validation and regression. Pre‑test steps included defining test interfaces, enabling rate limits, notifying owners of dependent services (MySQL, TiDB, Redis, MQ), and consolidating multi‑scenario tests. During testing, pressure was applied gradually, monitoring all metrics, and stopping on errors, latency spikes, or resource saturation. Detailed observations were recorded for later analysis.

2.5 Chaos Engineering – Fault‑injection experiments targeted common failure modes: database unavailability, cache cluster issues, RPC service failures, network jitter, and node crashes. The goal was to verify end‑to‑end high‑availability, including monitoring, alerting, diagnosis, and recovery procedures.

2.6 Degradation Strategies – Both automatic (code‑level error handling, circuit breakers) and manual (feature toggles via configuration center) degradation methods were defined. Services were classified into core (P0) and non‑core (P1‑P3) tiers, following the 20/80 principle, to focus protection on the most critical components.

2.7 SOP / Pre‑plan – For each scenario, a Standard Operating Procedure was drafted covering symptom identification, response steps, responsible personnel, and execution details, enabling rapid, decision‑driven incident handling.

2.8 Rate Limiting – When sudden traffic spikes or resource anomalies threatened stability, rate‑limiting was applied as a last‑resort mitigation. Policies included SLB limits, Ekango limits, and WAF per‑IP throttling, with clear criteria for activation and deactivation.

2.9 On‑site Assurance – Dedicated on‑call staff monitored dashboards, logs, and alerts, recorded incident handling, and reported progress in real time.

The case study concludes that the systematic application of planning, scenario modeling, traffic estimation, stress testing, chaos engineering, degradation, SOPs, and rate limiting ensured a successful live broadcast. Future work will expand on these practices in a dedicated follow‑up article.

References: [1] Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy. Site Reliability Engineering. [2] https://sre.google/sre-book/table-of-contents/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system stability stress testing rate limiting degradation

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.