Operations 23 min read

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

This article explains how Bilibili designed and implemented an Emergency Response Center (ERC) to manage the full fault lifecycle—detection, response, delimitation, localization, mitigation and recovery—using alert rules, automated recall, integrated customer feedback, clear role assignments, mobile support, and data‑driven post‑mortems, ultimately reducing MTTR and improving service reliability.

dbaplus Community

Sep 13, 2024

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

Background

Rapid business growth and frequent releases make 100% availability impossible. A unified fault‑lifecycle management system is required to detect, delimit, locate, mitigate, and recover from incidents.

Fault Lifecycle and ERC Goals

The fault lifecycle consists of occurrence → detection → response → delimitation → localization → mitigation → recovery . Bilibili targets 1‑minute detection, 3‑minute response, 5‑minute delimitation, 10‑minute recovery . The Emergency Response Center (ERC) centralizes these phases and provides a platform for incident handling.

Fault Detection

Detection relies on a comprehensive set of alert rules covering:

System dimensions: CPU, memory, I/O, etc.

Middleware: MySQL replication lag, slow queries, message‑queue backlog, consumer failures.

Application metrics: error logs, error counts, request latency, business‑level KPIs such as core‑app availability, login failures, recharge failures, empty search results.

Customer‑service feedback is also ingested; repeated complaints above a threshold automatically create a fault entry.

Fault Handling

When a fault is detected, ERC automatically:

Publishes a transparent fault announcement.

Creates a collaboration group based on service‑tree mappings.

Pulls relevant stakeholders (owners, on‑call engineers, product owners).

Escalation policy:

If no response within 1 minute, a phone call is placed to the on‑call SRE.

If still no response after 3 minutes, the SRE leader is called.

Automation and Fault Recall

Automatic recall triggers incidents when selected business‑impact metrics cross defined thresholds. Each emergency scenario is configured with:

Scenario name.

Associated business (linked to the service‑tree).

Maintainer (SRE owner).

Fixed collaboration group (pre‑created chat).

On‑call linkage (integration with the on‑call platform).

Escalation strategy (1‑minute phone, 3‑minute leader call).

Rule expression: metric OP threshold FOR duration (e.g., core_app_availability < 95% FOR 2m).

Recall suppression mechanisms include:

Converging alerts from the same business domain within a time window.

Manual merging of duplicate incidents.

Limits are set to avoid “wolf‑alarm” situations; rate‑limiting prevents excessive recall from high‑frequency alerts such as throttling.

Collaboration Model

ERC adopts a two‑role model inspired by Google’s Incident Command System:

Fault Commander : Assembles the team, assigns tasks, makes decisions, and escalates if needed.

Frontline Engineers : Diagnose, contain, and resolve the issue.

The first responder becomes the commander; escalation to higher‑level leaders occurs for complex or wide‑impact incidents.

Mobile Support

A mobile client enables SREs to receive alerts, join collaboration groups, and execute mitigation actions from any location, ensuring 24/7 responsiveness without a dedicated NOC.

Post‑mortem and Data Operations

After recovery, ERC generates a default timeline covering all lifecycle stages and provides a structured post‑mortem template with fields:

Title, time, impact, status.

Description, solution, reflection.

Improvement items, responsibility, metrics.

Key performance indicators tracked per incident:

MTTI (Mean Time to Identify): time from fault occurrence to detection.

MTTK (Mean Time to Know): time from detection to delimitation/localization.

MTTF (Mean Time to Fix): time from delimitation to mitigation.

MTTV (Mean Time to Verify): time from mitigation to verification of full recovery.

Platform Evolution

Three major releases:

1.0 Problem Management (2019): Basic issue recording, limited post‑mortem support.

2.0 Event Collaboration : Added real‑time fault announcements, subscription‑based notifications, and improved post‑mortem management.

3.0 Emergency Response Center : Full fault‑lifecycle platform with 1‑3‑5‑10 targets, automatic recall, mobile client, and integrated service‑tree mapping.

Key architectural components include:

Service‑tree metadata for automatic stakeholder mapping.

OpenAPI for third‑party integration (e.g., customer‑service systems).

Dashboard visualizing fault lifecycle and metrics.

Results and Outlook

Since launch:

Automatic recall accuracy and precision > 80%.

Overall 1‑3‑5‑10 compliance > 60%.

Significant reduction in MTTR.

Future work focuses on:

Pre‑plan platform with root‑cause recommendations.

SLO‑driven ERC recall.

Infrastructure‑fault integration (network, switch failures).

Fault sub‑operations and deeper automation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Platform Engineering Automation SRE Incident Management MTTR fault handling

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.