Operations 22 min read

How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons

This article details Bilibili's SRE approach to large‑scale live events, covering background, activity scenarios, resource planning, performance testing, chaos‑engineering drills, technical safeguards such as DCDN, SLB, WAF, PaaS, cache and DB, pre‑plan capabilities, post‑mortem analysis, and future outlook, illustrating how systematic capacity management and automated resilience practices enable stable operation for events with tens of millions of concurrent users.

Efficient Ops

Jun 19, 2022

How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons

Background

Bilibili hosts multiple large‑scale activities each year (e.g., New Year celebrations, "Most Beautiful Night", LOL World Finals, e‑commerce promotions). The most popular events attract tens of millions of concurrent viewers, requiring robust SRE support to maintain stability without infrastructure or service failures.

Activity Scenarios

Three representative cases illustrate typical challenges:

Unexpected promotional links from external apps (WeChat, Toutiao) could cause sudden traffic spikes that are hard to predict.

Planned in‑app push notifications to all users risk overwhelming services if users click activity links.

Post‑event push notifications can overload non‑activity services, causing outages.

SRE therefore collaborates closely with operations and product teams to gather detailed activity information, including format, key scenes, estimated online users, external links, push plans, post‑event behavior, and timelines.

Resource Preparation

Basic Resources

Core infrastructure resources to be verified include DNS, dynamic CDN (DCDN), static CDN bandwidth, live‑chat bandwidth, DDoS protection, L4/L7 load balancers, WAF, IDC‑to‑cloud and inter‑IDC dedicated bandwidth, NAT bandwidth, network hardware bandwidth, and logging/monitoring.

Logging and monitoring are critical for real‑time incident response; network hardware bandwidth must be sufficient to handle sudden traffic bursts.

Business Resources

Application‑level resources include PaaS (container) capacity, IaaS (bare‑metal) capacity, cache, message queue, KV store, and database resources. Capacity management systems provide visibility into usage levels and buffer capacity, enabling rapid procurement or hybrid‑cloud scaling.

Performance Testing & Drills

Each activity undergoes up to three rounds of performance testing:

Identify bottlenecks using existing resources.

Test after resource delivery and service optimizations to meet activity goals.

Validate final safeguards and capacity plans before launch (optional).

Key focus areas per round include testing tool stability, service bottlenecks, middleware limits, and end‑to‑end chain performance. SRE also coordinates cross‑team tests for shared services (search, payment, etc.).

Chaos‑Engineering Drills

Since 2019, Bilibili uses an internal ChaosBlade‑based platform to inject failures at node, hardware, upstream/downstream, and middleware levels. Over 3,000 drills have uncovered 200+ hidden issues, improving overall service reliability.

Technical Guarantees

DCDN

Cacheable interfaces (e.g., live‑chat, gift lists) can be cached at DCDN to reduce origin load. DCDN also supports multi‑active, multi‑region traffic steering for same‑city active‑active services.

Layer‑7 SLB

Custom OpenResty‑based SLB provides global rate limiting and automatic failover for active‑active services, protecting API gateways and services from overload.

WAF

Implements per‑IP rate limiting and malicious IP blocking based on request characteristics (UA, Referer) to mitigate abuse.

PaaS (K8s)

Horizontal Pod Autoscaler (HPA) scales services based on CPU/Memory/GPU metrics for gradual traffic growth, while Vertical Pod Autoscaler (VPA) reclaims resources when overall pool usage is high. Hybrid‑cloud nodes can be provisioned within minutes for burst capacity.

Cache

Capacity planning for Redis/Memcached includes pre‑expansion, hot‑key monitoring, and cautious slot migration during urgent scaling.

Database

Automatic read‑replica degradation when replication lag exceeds 120 seconds, SQL black‑listing, and cross‑region read load balancing improve resilience.

Monitoring Dashboards

Unified dashboards display business metrics, infrastructure capacity, middleware health, and real‑time alerts, serving as the central view during live events.

Pre‑Plan Capability

Pre‑defined runbooks cover traffic shifting, service degradation, rate limiting, rollback, restart, and scaling. Prioritization considers failure probability, activation speed, impact, complexity, and idempotence.

Post‑Event Review

After each activity, SRE conducts a comprehensive post‑mortem covering goals, process review, data summary (online users, resource usage), problem analysis, and reflection. Checklists ensure recurring issues are addressed in future templates.

Outlook

Continuous improvement aims to automate manual steps (e.g., resource inventory) and integrate them into an activity‑assurance platform, further enhancing efficiency for future mega‑events such as the upcoming "Most Beautiful Night" and S12 finals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance testing chaos engineering SRE capacity planning

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.