Industry Insights 22 min read

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

This article details Bilibili's end‑to‑end technical planning, traffic‑estimation models, and concrete optimizations—including hotspot caching, traffic dispersion, long‑connection isolation, and automated fault‑injection—that enabled the S13 League of Legends finals to serve over 1.2 billion viewers with stable, low‑latency streaming.

Architect

Dec 13, 2023

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

Background and Objectives

The 13th League of Legends World Championship (S13) was streamed on Bilibili with a target of 1.2 billion cumulative viewers. The technical goal was to keep the service stable and performant under extreme peak‑concurrent‑user (PCU) loads, which peaked at >4.6 billion during the finals.

Business Scenario Map and Core Metric

All functional modules required for S13 (60+ items) were catalogued. The primary performance indicator is PCU – the maximum number of concurrent users in the main live‑room. Supporting scenarios include:

Activity page – promotional pages that drive user participation.

Traffic entry – slots that funnel users into the main room.

Main room – live video, interactive features (gift, chat, special effects) and long‑connection traffic.

Replay page – post‑event video and discussion.

Traffic Estimation Model

Business metrics are converted to technical QPS/TPS using the following relationship:

Target QPS = Exposure × Conversion₁ × … × Conversionₙ ÷ Distribution‑time
          = PCU × Conversion₁ × … × Conversionₙ ÷ Distribution‑time

The formula is applied to each scenario to predict request rates.

Technical Optimizations

Room‑Entry (Join) Scenario

When a user clicks a splash screen, push notification, or other entry point, the client requests the stream URL, room metadata, and historical chat. Total join QPS is the sum of QPS from all entry channels. For full‑push,

QPS = TotalUsers × DeliveryRate × ClickRate ÷ PushDuration

Rooms whose PCU exceeds a threshold are cached in a “hot‑room” memory store, raising cache‑hit rates and preventing hotspot overload.

Lucky‑Moment (天选时刻)

A pop‑up invites users to participate. Initial QPS estimate: QPS = PCU × ParticipationClickRate Because PCU can reach millions, the write path becomes a bottleneck. The mitigation spreads the pop‑up display over a configurable “dispersion time”, reducing instantaneous QPS to:

QPS = PCU × ParticipationClickRate ÷ DispersionTime

This eliminates spikes without affecting user experience.

Long‑Connection

Real‑time chat and other interactive features rely on persistent connections. The pressure on edge nodes is: Pressure = N × PCU where N is the number of simultaneous broadcast events. Bilibili isolates main‑room traffic from other rooms, monitors each broadcast’s QPS and payload size, and applies dedicated rate‑limits, achieving balanced bandwidth cost and user experience.

Scatter (Exit) Scenario

After a match, users either return to the entry page (click‑through) or swipe to another live room (swipe‑conversion). Both generate sharp spikes similar to a cinema audience exiting simultaneously. Mitigation disables automatic refresh on the entry page during overload and pre‑caches candidate next‑rooms based on recommendation results, improving cache‑hit rates.

Global Traffic Monitoring

Because a downstream service may be invoked by multiple scenarios, traffic is monitored both at the interface level and the whole‑event level. This informs capacity planning and resource procurement before the event.

Technical‑Chain Mapping (Advisor)

Each scenario required ~2 days to enumerate user actions, identify request interfaces, and drill down dependency chains. To accelerate, the Advisor platform captured packet traces, automatically generated dependency graphs, and calculated amplification factors. The visualized chain shows each interface’s QPS/TPS and amplification multiplier.

Fault Injection (Fault)

Fault distinguishes strong vs. weak dependencies. Strong dependencies receive detection mechanisms and fallback plans; weak dependencies are allowed to degrade gracefully. The process isolates terminal‑facing interfaces, then injects failures automatically via the Fault platform, validating the entire business‑scenario flow.

Full‑Link Stress Test (Melloi)

Three typical bottlenecks were addressed:

Hotspot keys (room‑id/anchor‑id caches).

Cache‑penetration (cold users during the event).

Consumption backlog (high‑frequency reward‑triggered events).

Advisor‑derived traffic profiles fed the Melloi platform, enabling rapid dataset preparation and automated health‑check of each layer’s metrics after the test.

Pre‑Plan SOP

For each identified strong dependency, a SOP template defines detection (≤1 min),定位 (≤5 min) and recovery (≤10 min) steps, assigning responsible owners.

Change Control (ChangePilot)

During the event, the ChangePilot platform enforced strict change‑gate policies. Non‑critical changes required email approval with risk analysis; critical‑day changes used a sealed‑network policy with a green‑channel for emergencies.

In‑Event Monitoring

Real‑time dashboards based on SLOs displayed service availability, saturation, PCU, QPS, P90 latency, and rate‑limit status. Component‑level capacity water‑marks (cache, DB, MQ) were visualised with tiered thresholds to trigger early alerts.

Outlook

Experience from S13 will be abstracted into a platform‑level framework for future large‑scale live events, covering traffic‑modeling, hotspot mitigation, automated chain extraction, fault‑injection, and SOP generation.

Conclusion

S13 achieved its business target of >1.2 billion viewers, with a peak‑concurrent‑user count exceeding 4.6 billion during the finals. The end‑to‑end planning, traffic‑modeling, hotspot‑caching, dispersion, isolation, fault‑injection, stress‑testing, SOPs, and change‑control together ensured a reliable massive live‑stream.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Live Streaming Observability High concurrency Incident Management capacity planning Traffic Engineering

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.