Operations 19 min read

How Bilibili Guaranteed Seamless Live Streaming for the League of Legends S12 Finals

Bilibili’s S12 technical guarantee team coordinated dozens of engineering groups, performed resource estimation, built a shared resource pool, applied chaos engineering, high‑availability architecture, and systematic performance testing to ensure the League of Legends World Championship livestream remained stable and responsive under peak traffic.

dbaplus Community

Nov 28, 2022

How Bilibili Guaranteed Seamless Live Streaming for the League of Legends S12 Finals

Background

The League of Legends World Championship (S12) generated a peak concurrent viewership of over 310 million on Bilibili, creating a severe stability challenge for the live‑streaming infrastructure.

Project Initiation

In late July a dedicated S12 technical‑guarantee project was launched with the goal of “tea‑time guarantee” – no manual interventions during the event. Approximately 300 engineers from business development, SRE, infrastructure, DBA and big‑data teams participated.

Resource Planning

Resource needs were divided into pre‑event, during‑event and post‑event phases and estimated using historical S11 data. The gap calculation used the following formulas:

delta = (a - b) / b + 1  // a: last year peak, b: last year daily peak
required = c * delta / 0.4 - d  // c: this year daily peak, d: current capacity

A 40 % safety margin was applied, and estimates considered business PCU targets, upcoming feature releases and capacity constraints.

Resource‑Pool Governance

Previously live‑stream resources were isolated, requiring manual relabeling and migration after each broadcast. The new approach merged live‑stream resources into a shared pool, standardized host configuration and removed CPUSET bindings that caused CPU throttling. A code fix introduced a random sleep (10‑500 ms) before worker timers to avoid timer collisions:

protected function funcWorkerStart() {
    mt_srand((microtime(true) + posix_getpid()) * 10000);
    usleep(mt_rand(10000, 500000));
    // ...
    swoole_timer_tick(1000, $config_reload_cb);
}

Linux kernel upgrades and the automaxprocs utility for Go services reduced cgroup leaks and improved CPU isolation.

Business‑Scenario Breakdown

More than 30 features were classified into three priority levels (P0, P1, P2) based on DAU, revenue and downstream dependencies. A scenario map visualized service call graphs to aid risk identification and optimization.

High‑Availability Architecture

Eliminate single points of failure by deploying multiple instances for applications, job schedulers and resource‑pool hosts.

Adaptive traffic protection with client‑side throttling and server‑side rate limiting to smooth post‑broadcast traffic spikes.

Fine‑grained chaos engineering platform (beyond ChaosBlade) targeting specific interfaces and users.

Active‑active city‑level failover that automatically cuts over traffic within five minutes during data‑center outages.

Gateway migration from an Envoy‑based C++ gateway to a containerized Go gateway supporting HPA, auto‑scaling and granular rate‑limit/downgrade controls. The gateway source is available at https://github.com/go-kratos/gateway.

Performance Testing

Load tests were executed on core scenarios (room entry, gifting, chat, homepage) during low‑peak windows in production. Monitoring covered CPU, Redis, MySQL, TiDB and message‑queue metrics. Test steps:

Gradual ramp‑up starting with short 1‑minute bursts.

Continuous monitoring; abort if any metric exceeds predefined thresholds.

Record QPS versus resource pressure for post‑analysis.

Write traffic was isolated via a “Mirror” layer that redirected writes away from production databases.

Guarantee Plans

Business‑level: scenario‑map‑driven mitigations, chaos‑engineered safeguards and limits derived from performance tests.

Technical‑level: gateway rate‑limit, per‑zone/caller service quota, gateway downgrade, HPA and hybrid‑cloud spillover.

Quality Control

After the event a strict change‑review process (S12 strong‑control upgrade) was enforced. An in‑house alert‑collaboration platform provided unified routing, subscription, aggregation and status tracking.

On‑Site Operations

A command‑center staffed by a commander and on‑call engineers from business, development and infrastructure teams handled priority assessment, emergency response and technical decision‑making.

Conclusion & Outlook

Through kernel upgrades, CPUSET removal, resource‑pool consolidation, containerized gateway migration, active‑active deployment, HPA, chaos engineering and full‑link load testing, Bilibili delivered a flawless S12 live‑streaming experience without forced throttling, downgrade or circuit‑breaker events. Future work includes deeper cloud‑native adoption, automated resource scheduling and broader full‑link testing coverage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Live Streaming Resource Management performance testing chaos engineering SRE

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.