SRE Practices for Large‑Scale Event Assurance at Bilibili
Bilibili’s SRE team ensures flawless large‑scale online events by meticulously gathering activity details, provisioning DNS, CDN, networking and compute resources, conducting multi‑stage performance tests and chaos‑engineering drills, applying layered traffic controls, maintaining historical checklists, executing predefined contingency responses, and iterating post‑mortems to drive continuous automation and reliability.
Background : Bilibili hosts multiple large‑scale online events each year (e.g., New Year Gala, "Most Beautiful Night", LOL World Finals, 626 Shopping Festival, 919 Flash Sale). The S11 finals attracted over ten million concurrent viewers and ran without infrastructure or service failures. The SRE team is responsible for ensuring stability and handling unexpected traffic spikes.
Event Scenarios : Several real cases illustrate challenges such as sudden promotional links from external apps, large‑scale push notifications to all online users, and post‑event traffic surges that can overload services not covered by the activity plan. SRE therefore collects detailed activity information, including activity form, key scenarios, estimated online users, external links, push schedule, post‑event behavior, and timeline.
Resource Preparation :
Basic resources – DNS, dynamic CDN, static CDN bandwidth, live‑chat bandwidth, DDoS protection, L4/L7 load balancers, WAF, IDC and cloud bandwidth, NAT, network hardware bandwidth, logging/monitoring.
Business resources – PaaS (container) resources, IaaS (bare‑metal) resources, cache, MQ, KV store, DB storage.
Capacity management system is used to predict required resources based on historical data and estimated online users.
Performance Testing & Drills :
Three rounds of performance testing: (1) identify bottlenecks on existing resources, (2) test after resource provisioning and service optimization, (3) final validation after all safeguards are in place.
Focus areas per round include testing tools stability, bottleneck identification, middleware performance, limit‑rate configuration, and verification of auto‑scaling policies (HPA, VPA).
Chaos engineering drills (node failure, hardware contention, upstream/downstream service failures, middleware failures) are performed using an internal ChaosBlade‑based platform, with over 3,000 drills executed.
Historical Review ("Learn from History") : A checklist of past issues (e.g., missed web‑chain testing, VPA disabled during events, missing HPA on critical services) is maintained to prevent recurrence.
Technical Guarantees :
DCDN – caching and multi‑region traffic splitting.
Layer‑7 SLB – global rate limiting, automatic failover for multi‑active services.
WAF – IP rate limiting and malicious IP blocking.
PaaS – Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) for dynamic scaling; hybrid‑cloud fallback when on‑prem capacity is exhausted.
Cache – capacity planning, hot‑key monitoring.
DB – automatic degradation on high replication lag, SQL black‑list, cross‑region read load balancing.
Monitoring dashboards – end‑to‑end visibility of business metrics, infrastructure capacity, and middleware health during the event.
Pre‑plan (Contingency) Capabilities : Pre‑defined response plans (traffic shifting, degradation, rate limiting, rollback, restart, scaling) are prioritized by failure probability, activation speed, impact, complexity, and idempotence.
Post‑mortem & Improvement :
After each event, SRE conducts a structured post‑mortem covering goals, process review, data summary (online users, resource usage), problem analysis, and reflection.
Findings feed back into the checklist and activity templates to continuously improve the assurance workflow.
Conclusion & Outlook : Participating in event assurance accelerates SRE’s understanding of the entire system, improves personal and team capabilities, and highlights areas for automation (e.g., reducing manual resource inventory). Future work aims to embed more of these processes into an activity‑assurance platform to further increase efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
