Operations 20 min read

Technical Assurance for Bilibili S12 Live Streaming Event: Architecture, Resource Management, and High Availability

To ensure “tea‑time” reliability for Bilibili’s 2022 S12 League of Legends championship, a cross‑functional technical‑assurance project introduced shared resource pools, CPUSET removal, multi‑instance HA architecture, adaptive throttling, chaos‑engineered fault injection, a new Golang gateway, extensive load testing, and coordinated on‑site duty, delivering uninterrupted live streaming without forced throttling.

Bilibili Tech

Nov 15, 2022

Technical Assurance for Bilibili S12 Live Streaming Event: Architecture, Resource Management, and High Availability

The 2022 S12 World Championship of League of Legends attracted massive viewership, and Bilibili served as the official live streaming platform. To guarantee system stability under peak traffic, a dedicated S12 technical assurance project was launched in late July, aiming for a "tea‑time" level of reliability.

Phases : The work was divided into three stages—pre‑event, during the event, and post‑event—as illustrated in the accompanying diagram.

Project Initiation : Approximately 300 engineers from business development, SRE, infrastructure, DBA, and big‑data teams aligned on goals and established a unified roadmap.

Resource Estimation : Using historical S11 capacity data, a growth factor (delta) was calculated and the required additional capacity was derived with the formula gap = c * delta / 0.4 - d, where c is the projected daily peak, d the current available capacity, and the 40% safety threshold ensures headroom.

Resource Pool Governance : Previously, live‑stream resources were isolated, requiring manual label changes and causing bottlenecks. A shared resource pool was introduced, with standardized host configurations to enable pooling and reduce waste.

CPUSET Removal : High‑frequency CPU throttling in PHP services (Swoole workers) was mitigated by randomizing timer start times and, for Golang services, by using automaxprocs to adapt GOMAXPROCS under CFS scheduling.

protected function funcWorkerStart() {
    return function (\swoole_server $Server, $iWorkerID) {
        // 部分服务会在init_callback中注册worker timer做配置加载
        // 但是会带来并发R过多导致cpu throttle，此处做下随机sleep 10-500ms
        mt_srand((microtime(true) + posix_getpid()) * 10000);
        usleep(mt_rand(10000, 500000));

        Main::init();

        //配置文件定期加载
        $config_reload_cb = function () use ($C, $Server, $iWorkerID) {
            app(Metrics::class)->flush();
            // load  config...
            // config check...
            // reload worker...
        };

        //注册定时器
        swoole_timer_tick(1000, $config_reload_cb);
    };
}

Scenario Mapping : Each of the 30+ new features was classified into P0, P1, or P2 tiers based on DAU, revenue, and dependency criteria. A detailed scenario map captured service calls, interfaces, caches, databases, and message queues, providing a clear view of dependencies.

High‑Availability Architecture : Single points of failure were eliminated by multi‑instance deployments, distributed job scheduling (XXL‑JOB), and resource‑pool health checks. A dual‑active data‑center setup ensures rapid failover within minutes.

High‑Online Adaptive Protection : To curb traffic spikes when users exit live rooms, adaptive throttling, random request delays, and client‑side flow control were introduced, reducing unnecessary backend load.

Chaos Engineering : A custom chaos platform (built on ChaosBlade) allowed fine‑grained fault injection at interface and user levels, enabling red‑blue drills on core flows such as room entry, gifting, and chat.

Gateway Migration : The legacy Envoy‑based API gateway was replaced with a self‑developed Golang gateway supporting containerization, HPA, fine‑grained rate limiting, and blue‑green deployments. The project is open‑source at https://github.com/go-kratos/gateway .

Performance Testing : A multi‑stage load‑testing plan used the Melloi platform to orchestrate end‑to‑end scenarios, with full‑link traffic isolation to avoid polluting production data. Test results guided capacity planning and scaling decisions.

Scaling Best Practice : Post‑test scaling was calculated using a replica‑estimation formula (shown in the diagram) to ensure sufficient headroom.

Guarantee Plans : Business‑level plans focused on scenario‑specific mitigations, while technical plans covered gateway and service quota limiting, HPA, hybrid‑cloud spillover, and resource elasticity.

Observability : A unified monitoring dashboard refreshed every minute displayed PCU, SLO compliance, and application saturation with color‑coded health states (red, orange, green).

Alarm Coordination : An in‑house alarm‑collaboration platform provided scene management, subscription, filtering, aggregation, and workflow tracking to streamline incident response.

On‑Site Duty : Roles were defined for on‑site commanders, duty engineers, and cross‑functional responders (business, operations, R&D, infrastructure, SRE, DBA, network).

Conclusion & Outlook : Over five years at Bilibili, the team evolved from reactive scaling to proactive, automated resilience (kernel upgrades, CPUSET removal, resource pooling, containerized gateways, dual‑active sites, HPA). The S12 event proceeded without any forced throttling, degradation, or user‑impacting interventions, achieving the coveted "tea‑time" guarantee. Future work includes deeper cloud‑native adoption, expanded full‑link load testing, and continued refinement of middleware and multi‑active architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

high availability chaos engineering SRE

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.