Operations 22 min read

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.

Bilibili Tech

Aug 16, 2024

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Background

As business scale expands and daily demands iterate quickly, even the best architecture and production system cannot guarantee 100% availability. Following Murphy's law, failures are inevitable in production. To locate, contain, and prevent recurring root‑cause failures, a unified management of the full incident lifecycle is required.

Challenges of Stability Assurance

Typical incidents include non‑core service releases causing core service outages, CDN interruptions, OOM crashes, and data anomalies. Pain points are incomplete detection coverage, difficulty in scoping and locating faults, stale runbooks, and lack of coordinated response.

Incident Lifecycle

The lifecycle consists of occurrence, detection, response, scoping, locating, damage‑control, and recovery. Scoping defines the fault location, initial cause locating identifies the immediate trigger, and root‑cause locating uncovers the fundamental reason.

ERC (Emergency Response Center) Design

ERC aims for “1‑minute detection, 3‑minute response, 5‑minute scoping, 10‑minute recovery.” It evolves through three versions: 1.0 Issue Management (basic post‑mortem), 2.0 Event Collaboration (notification and post‑mortem modules), and 3.0 Full ERC (covers detection, handling, and recovery).

Detection

Detection relies on multi‑dimensional alerts (system metrics, middleware health, application errors) and customer‑service feedback. To reduce alert fatigue, SREs prioritize alerts that are likely to cause user‑visible failures.

Handling

Key practices include:

Public incident announcement to improve transparency.

Clear command structure with a fault commander and frontline responders.

Automated collaboration groups triggered by ERC when a fault is recorded.

Standardized incident update templates (title, time, impact, status, description).

Mobile support enables SREs to respond from anywhere, ensuring timely decisions.

Damage‑Control (Mitigation)

Rollback of recent changes.

Degradation of non‑critical features.

Traffic shifting (multi‑active deployment).

Additional measures: rate‑limiting, auto‑scaling, service restart.

Recovery

After mitigation, services are restored to their pre‑incident state. ERC tracks MTTR and its sub‑metrics:

MTTI – Mean Time to Identify (detection).

MTTK – Mean Time to Know (scoping).

MTTF – Mean Time to Fix (repair).

MTTV – Mean Time to Verify (validation).

Optimizing each stage reduces overall MTTR.

Post‑Incident Review

ERC generates a timeline of the incident phases and provides a templated post‑mortem covering impact analysis, solution, reflection, loss quantification, and improvement actions with owners and deadlines.

Data Operations

Metrics are continuously analyzed to evaluate prevention effectiveness, recall accuracy, 1‑3‑5‑10 compliance, MTTR trends, fault cause distribution, and change‑induced incidents. These insights drive platform enhancements.

Results and Outlook

Since launch, ERC has achieved >80% automatic recall accuracy, >60% compliance with the 1‑3‑5‑10 targets, and a noticeable reduction in MTTR. Future work includes automated root‑cause recommendation, expanded SLO quantification, infrastructure‑level incident recall, and fault‑specific operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Platform Engineering Automation SRE Incident Management MTTR fault handling

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.