Operations 20 min read

SRE Practices for Large‑Scale Event Assurance at Bilibili

Bilibili’s SRE team ensures flawless large‑scale online events by meticulously gathering activity details, provisioning DNS, CDN, networking and compute resources, conducting multi‑stage performance tests and chaos‑engineering drills, applying layered traffic controls, maintaining historical checklists, executing predefined contingency responses, and iterating post‑mortems to drive continuous automation and reliability.

Bilibili Tech

Jun 14, 2022

SRE Practices for Large‑Scale Event Assurance at Bilibili

Background : Bilibili hosts multiple large‑scale online events each year (e.g., New Year Gala, "Most Beautiful Night", LOL World Finals, 626 Shopping Festival, 919 Flash Sale). The S11 finals attracted over ten million concurrent viewers and ran without infrastructure or service failures. The SRE team is responsible for ensuring stability and handling unexpected traffic spikes.

Event Scenarios : Several real cases illustrate challenges such as sudden promotional links from external apps, large‑scale push notifications to all online users, and post‑event traffic surges that can overload services not covered by the activity plan. SRE therefore collects detailed activity information, including activity form, key scenarios, estimated online users, external links, push schedule, post‑event behavior, and timeline.

Resource Preparation :

Basic resources – DNS, dynamic CDN, static CDN bandwidth, live‑chat bandwidth, DDoS protection, L4/L7 load balancers, WAF, IDC and cloud bandwidth, NAT, network hardware bandwidth, logging/monitoring.

Business resources – PaaS (container) resources, IaaS (bare‑metal) resources, cache, MQ, KV store, DB storage.

Capacity management system is used to predict required resources based on historical data and estimated online users.

Performance Testing & Drills :

Three rounds of performance testing: (1) identify bottlenecks on existing resources, (2) test after resource provisioning and service optimization, (3) final validation after all safeguards are in place.

Focus areas per round include testing tools stability, bottleneck identification, middleware performance, limit‑rate configuration, and verification of auto‑scaling policies (HPA, VPA).

Chaos engineering drills (node failure, hardware contention, upstream/downstream service failures, middleware failures) are performed using an internal ChaosBlade‑based platform, with over 3,000 drills executed.

Historical Review ("Learn from History") : A checklist of past issues (e.g., missed web‑chain testing, VPA disabled during events, missing HPA on critical services) is maintained to prevent recurrence.

Technical Guarantees :

DCDN – caching and multi‑region traffic splitting.

Layer‑7 SLB – global rate limiting, automatic failover for multi‑active services.

WAF – IP rate limiting and malicious IP blocking.

PaaS – Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) for dynamic scaling; hybrid‑cloud fallback when on‑prem capacity is exhausted.

Cache – capacity planning, hot‑key monitoring.

DB – automatic degradation on high replication lag, SQL black‑list, cross‑region read load balancing.

Monitoring dashboards – end‑to‑end visibility of business metrics, infrastructure capacity, and middleware health during the event.

Pre‑plan (Contingency) Capabilities : Pre‑defined response plans (traffic shifting, degradation, rate limiting, rollback, restart, scaling) are prioritized by failure probability, activation speed, impact, complexity, and idempotence.

Post‑mortem & Improvement :

After each event, SRE conducts a structured post‑mortem covering goals, process review, data summary (online users, resource usage), problem analysis, and reflection.

Findings feed back into the checklist and activity templates to continuously improve the assurance workflow.

Conclusion & Outlook : Participating in event assurance accelerates SRE’s understanding of the entire system, improves personal and team capabilities, and highlights areas for automation (e.g., reducing manual resource inventory). Future work aims to embed more of these processes into an activity‑assurance platform to further increase efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Operations chaos engineering SRE capacity planning Event Reliability

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.