How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons
This article details Bilibili's SRE approach to large‑scale live events, covering background, activity scenarios, resource planning, performance testing, chaos‑engineering drills, technical safeguards such as DCDN, SLB, WAF, PaaS, cache and DB, pre‑plan capabilities, post‑mortem analysis, and future outlook, illustrating how systematic capacity management and automated resilience practices enable stable operation for events with tens of millions of concurrent users.
Background
Bilibili hosts multiple large‑scale activities each year (e.g., New Year celebrations, "Most Beautiful Night", LOL World Finals, e‑commerce promotions). The most popular events attract tens of millions of concurrent viewers, requiring robust SRE support to maintain stability without infrastructure or service failures.
Activity Scenarios
Three representative cases illustrate typical challenges:
Unexpected promotional links from external apps (WeChat, Toutiao) could cause sudden traffic spikes that are hard to predict.
Planned in‑app push notifications to all users risk overwhelming services if users click activity links.
Post‑event push notifications can overload non‑activity services, causing outages.
SRE therefore collaborates closely with operations and product teams to gather detailed activity information, including format, key scenes, estimated online users, external links, push plans, post‑event behavior, and timelines.
Resource Preparation
Basic Resources
Core infrastructure resources to be verified include DNS, dynamic CDN (DCDN), static CDN bandwidth, live‑chat bandwidth, DDoS protection, L4/L7 load balancers, WAF, IDC‑to‑cloud and inter‑IDC dedicated bandwidth, NAT bandwidth, network hardware bandwidth, and logging/monitoring.
Logging and monitoring are critical for real‑time incident response; network hardware bandwidth must be sufficient to handle sudden traffic bursts.
Business Resources
Application‑level resources include PaaS (container) capacity, IaaS (bare‑metal) capacity, cache, message queue, KV store, and database resources. Capacity management systems provide visibility into usage levels and buffer capacity, enabling rapid procurement or hybrid‑cloud scaling.
Performance Testing & Drills
Each activity undergoes up to three rounds of performance testing:
Identify bottlenecks using existing resources.
Test after resource delivery and service optimizations to meet activity goals.
Validate final safeguards and capacity plans before launch (optional).
Key focus areas per round include testing tool stability, service bottlenecks, middleware limits, and end‑to‑end chain performance. SRE also coordinates cross‑team tests for shared services (search, payment, etc.).
Chaos‑Engineering Drills
Since 2019, Bilibili uses an internal ChaosBlade‑based platform to inject failures at node, hardware, upstream/downstream, and middleware levels. Over 3,000 drills have uncovered 200+ hidden issues, improving overall service reliability.
Technical Guarantees
DCDN
Cacheable interfaces (e.g., live‑chat, gift lists) can be cached at DCDN to reduce origin load. DCDN also supports multi‑active, multi‑region traffic steering for same‑city active‑active services.
Layer‑7 SLB
Custom OpenResty‑based SLB provides global rate limiting and automatic failover for active‑active services, protecting API gateways and services from overload.
WAF
Implements per‑IP rate limiting and malicious IP blocking based on request characteristics (UA, Referer) to mitigate abuse.
PaaS (K8s)
Horizontal Pod Autoscaler (HPA) scales services based on CPU/Memory/GPU metrics for gradual traffic growth, while Vertical Pod Autoscaler (VPA) reclaims resources when overall pool usage is high. Hybrid‑cloud nodes can be provisioned within minutes for burst capacity.
Cache
Capacity planning for Redis/Memcached includes pre‑expansion, hot‑key monitoring, and cautious slot migration during urgent scaling.
Database
Automatic read‑replica degradation when replication lag exceeds 120 seconds, SQL black‑listing, and cross‑region read load balancing improve resilience.
Monitoring Dashboards
Unified dashboards display business metrics, infrastructure capacity, middleware health, and real‑time alerts, serving as the central view during live events.
Pre‑Plan Capability
Pre‑defined runbooks cover traffic shifting, service degradation, rate limiting, rollback, restart, and scaling. Prioritization considers failure probability, activation speed, impact, complexity, and idempotence.
Post‑Event Review
After each activity, SRE conducts a comprehensive post‑mortem covering goals, process review, data summary (online users, resource usage), problem analysis, and reflection. Checklists ensure recurring issues are addressed in future templates.
Outlook
Continuous improvement aims to automate manual steps (e.g., resource inventory) and integrate them into an activity‑assurance platform, further enhancing efficiency for future mega‑events such as the upcoming "Most Beautiful Night" and S12 finals.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
