How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability
This article explains how Bilibili designed and implemented an Emergency Response Center (ERC) to manage the full fault lifecycle—detection, response, delimitation, localization, mitigation and recovery—using alert rules, automated recall, integrated customer feedback, clear role assignments, mobile support, and data‑driven post‑mortems, ultimately reducing MTTR and improving service reliability.
Background
Rapid business growth and frequent releases make 100% availability impossible. A unified fault‑lifecycle management system is required to detect, delimit, locate, mitigate, and recover from incidents.
Fault Lifecycle and ERC Goals
The fault lifecycle consists of occurrence → detection → response → delimitation → localization → mitigation → recovery . Bilibili targets 1‑minute detection, 3‑minute response, 5‑minute delimitation, 10‑minute recovery . The Emergency Response Center (ERC) centralizes these phases and provides a platform for incident handling.
Fault Detection
Detection relies on a comprehensive set of alert rules covering:
System dimensions: CPU, memory, I/O, etc.
Middleware: MySQL replication lag, slow queries, message‑queue backlog, consumer failures.
Application metrics: error logs, error counts, request latency, business‑level KPIs such as core‑app availability, login failures, recharge failures, empty search results.
Customer‑service feedback is also ingested; repeated complaints above a threshold automatically create a fault entry.
Fault Handling
When a fault is detected, ERC automatically:
Publishes a transparent fault announcement.
Creates a collaboration group based on service‑tree mappings.
Pulls relevant stakeholders (owners, on‑call engineers, product owners).
Escalation policy:
If no response within 1 minute, a phone call is placed to the on‑call SRE.
If still no response after 3 minutes, the SRE leader is called.
Automation and Fault Recall
Automatic recall triggers incidents when selected business‑impact metrics cross defined thresholds. Each emergency scenario is configured with:
Scenario name.
Associated business (linked to the service‑tree).
Maintainer (SRE owner).
Fixed collaboration group (pre‑created chat).
On‑call linkage (integration with the on‑call platform).
Escalation strategy (1‑minute phone, 3‑minute leader call).
Rule expression: metric OP threshold FOR duration (e.g., core_app_availability < 95% FOR 2m).
Recall suppression mechanisms include:
Converging alerts from the same business domain within a time window.
Manual merging of duplicate incidents.
Limits are set to avoid “wolf‑alarm” situations; rate‑limiting prevents excessive recall from high‑frequency alerts such as throttling.
Collaboration Model
ERC adopts a two‑role model inspired by Google’s Incident Command System:
Fault Commander : Assembles the team, assigns tasks, makes decisions, and escalates if needed.
Frontline Engineers : Diagnose, contain, and resolve the issue.
The first responder becomes the commander; escalation to higher‑level leaders occurs for complex or wide‑impact incidents.
Mobile Support
A mobile client enables SREs to receive alerts, join collaboration groups, and execute mitigation actions from any location, ensuring 24/7 responsiveness without a dedicated NOC.
Post‑mortem and Data Operations
After recovery, ERC generates a default timeline covering all lifecycle stages and provides a structured post‑mortem template with fields:
Title, time, impact, status.
Description, solution, reflection.
Improvement items, responsibility, metrics.
Key performance indicators tracked per incident:
MTTI (Mean Time to Identify): time from fault occurrence to detection.
MTTK (Mean Time to Know): time from detection to delimitation/localization.
MTTF (Mean Time to Fix): time from delimitation to mitigation.
MTTV (Mean Time to Verify): time from mitigation to verification of full recovery.
Platform Evolution
Three major releases:
1.0 Problem Management (2019): Basic issue recording, limited post‑mortem support.
2.0 Event Collaboration : Added real‑time fault announcements, subscription‑based notifications, and improved post‑mortem management.
3.0 Emergency Response Center : Full fault‑lifecycle platform with 1‑3‑5‑10 targets, automatic recall, mobile client, and integrated service‑tree mapping.
Key architectural components include:
Service‑tree metadata for automatic stakeholder mapping.
OpenAPI for third‑party integration (e.g., customer‑service systems).
Dashboard visualizing fault lifecycle and metrics.
Results and Outlook
Since launch:
Automatic recall accuracy and precision > 80%.
Overall 1‑3‑5‑10 compliance > 60%.
Significant reduction in MTTR.
Future work focuses on:
Pre‑plan platform with root‑cause recommendations.
SLO‑driven ERC recall.
Infrastructure‑fault integration (network, switch failures).
Fault sub‑operations and deeper automation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
