Design and Implementation of Bilibili's Emergency Response Center for Incident Management
Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.
Background
As business scale expands and daily demands iterate quickly, even the best architecture and production system cannot guarantee 100% availability. Following Murphy's law, failures are inevitable in production. To locate, contain, and prevent recurring root‑cause failures, a unified management of the full incident lifecycle is required.
Challenges of Stability Assurance
Typical incidents include non‑core service releases causing core service outages, CDN interruptions, OOM crashes, and data anomalies. Pain points are incomplete detection coverage, difficulty in scoping and locating faults, stale runbooks, and lack of coordinated response.
Incident Lifecycle
The lifecycle consists of occurrence, detection, response, scoping, locating, damage‑control, and recovery. Scoping defines the fault location, initial cause locating identifies the immediate trigger, and root‑cause locating uncovers the fundamental reason.
ERC (Emergency Response Center) Design
ERC aims for “1‑minute detection, 3‑minute response, 5‑minute scoping, 10‑minute recovery.” It evolves through three versions: 1.0 Issue Management (basic post‑mortem), 2.0 Event Collaboration (notification and post‑mortem modules), and 3.0 Full ERC (covers detection, handling, and recovery).
Detection
Detection relies on multi‑dimensional alerts (system metrics, middleware health, application errors) and customer‑service feedback. To reduce alert fatigue, SREs prioritize alerts that are likely to cause user‑visible failures.
Handling
Key practices include:
Public incident announcement to improve transparency.
Clear command structure with a fault commander and frontline responders.
Automated collaboration groups triggered by ERC when a fault is recorded.
Standardized incident update templates (title, time, impact, status, description).
Mobile support enables SREs to respond from anywhere, ensuring timely decisions.
Damage‑Control (Mitigation)
Rollback of recent changes.
Degradation of non‑critical features.
Traffic shifting (multi‑active deployment).
Additional measures: rate‑limiting, auto‑scaling, service restart.
Recovery
After mitigation, services are restored to their pre‑incident state. ERC tracks MTTR and its sub‑metrics:
MTTI – Mean Time to Identify (detection).
MTTK – Mean Time to Know (scoping).
MTTF – Mean Time to Fix (repair).
MTTV – Mean Time to Verify (validation).
Optimizing each stage reduces overall MTTR.
Post‑Incident Review
ERC generates a timeline of the incident phases and provides a templated post‑mortem covering impact analysis, solution, reflection, loss quantification, and improvement actions with owners and deadlines.
Data Operations
Metrics are continuously analyzed to evaluate prevention effectiveness, recall accuracy, 1‑3‑5‑10 compliance, MTTR trends, fault cause distribution, and change‑induced incidents. These insights drive platform enhancements.
Results and Outlook
Since launch, ERC has achieved >80% automatic recall accuracy, >60% compliance with the 1‑3‑5‑10 targets, and a noticeable reduction in MTTR. Future work includes automated root‑cause recommendation, expanded SLO quantification, infrastructure‑level incident recall, and fault‑specific operations.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.