Operations 23 min read

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Introduction

With Bilibili’s rapid growth in recent years, the scale of its business has expanded dramatically, the iteration speed has accelerated, and system complexity has increased. Daily online incidents have become more frequent and harder to diagnose. To keep the service stable at a high baseline, Bilibili established a dedicated SRE team that works on stability from both theoretical support and capability building, covering emergency response, incident operation, disaster recovery drills, and cultural awareness.

Theoretical Guidance

The article first defines stability theory and explains why theory is essential: it moves practitioners from merely knowing what happens to understanding why it happens, enabling more strategic capability building.

Key concepts introduced include:

2.1 Business Stability Operation

Business

In software, a business is a set of inter‑related services or applications that together achieve a goal.

Stability

Stability, as defined by Wikipedia, means a system produces bounded output for bounded input. In practice, a service (e.g., Bilibili’s “like” feature) is stable if user actions produce the expected result.

Operation

Operation refers to the planned, organized, and controlled management activities that SRE performs to prevent or reduce instability.

2.2 Incident (事态)

Borrowing from ITIL v4, an incident (事态) records any significant change to a resource (hardware, software, configuration). Bilibili aggregates alerts, changes, public complaints, and On‑Call tickets under this umbrella to enable unified analysis and rapid problem identification.

2.3 Fault

Both incidents and disasters are treated as “faults”. Any deviation from expected behavior is a fault, and unresolved incidents that expand in impact become faults.

2.4 Emergency Response

Derived from the GB/T24363‑2009 security‑incident standard, emergency response in stability focuses on three core elements: people, process, and platform.

People – the responders whose mindset and expertise are critical.

Process – standardized procedures for consistent handling.

Platform – tools that support people and processes, measure each stage, and drive continuous improvement.

2.5 Fault Lifecycle

Faults can be divided by stage (pre‑, during, post‑) or by workflow (prevention, occurrence, response,定位, recovery, review). This segmentation guides the design of response processes and platform features.

2.6 Stability Metrics

Fault‑Stage Metrics

The primary indicators are MTBF (Mean Time Between Failures) and MTTR (Mean Time To Recovery). Google further splits MTTR into:

MTTI – Mean Time To Incident detection.

MTTK – Mean Time To Locate the root cause.

MTTF – Mean Time To Fix the fault.

MTTV – Mean Time To Verify the fix.

Operational Metrics

Additional metrics include the ratio of manually reported vs. automatically detected faults, effectiveness of impact assessment, conversion rate from incident to fault, incident hand‑off rate, and improvement‑task completion & recurrence rates.

Operational Practice

3.1 Emergency Principles

The first principle is “stop loss, then locate”. When a fault occurs, immediate mitigation (e.g., rollback, restart, scaling) should precede root‑cause analysis.

Operations triad: Restart, Rollback, Scale‑out.

Service‑governance tools: Circuit‑breaker, Rate‑limiting, Degradation.

3.2 Organization & People

Effective incident handling requires the right people at the right time. Bilibili built an On‑Call system and a Fault‑Command IC (Incident Commander) to map business, organization, and personnel.

On‑Call System

The system provides calendar‑based duty scheduling, API access, and real‑time notifications, solving problems such as “cannot find the owner” and “being disturbed outside duty hours”.

Views

Two hierarchical views are maintained:

Function view – Organization → Function → Coverage (service). Duty tables attach to coverage nodes.

Business view – Organization → Business → Function. Duty tables attach to functions.

Both views share the same underlying data, ensuring consistency.

Fault‑Command IC

A virtual team that leads major incidents, clarifies responsibilities, and ensures precise information flow. After resolution, the IC drives post‑mortem, improvement tracking, and follow‑up.

Awareness

Regular internal sharing and cultural initiatives raise stability awareness among all engineers.

3.3 Efficient Collaboration

Key collaboration features include:

Clear role display on the incident detail page.

One‑click invitation of additional responders.

Automatic creation of emergency collaboration groups with incident briefs.

3.4 Fault Portrait (Root‑Cause Assistance)

Faults are modeled in three layers – hardware, infrastructure, and business – to generate a topological “portrait” that links related incidents, changes, and alerts, accelerating diagnosis.

3.5 Effective Post‑Mortem

Post‑mortems are required within 24 hours for normal faults and 48 hours for major faults. The process includes timeline reconstruction, root‑cause analysis (technical, organizational, procedural), and actionable improvement tasks. Automation links incident data to post‑mortem documents, standardizes formats, and generates reports for management.

Challenges

4.1 Unified Metadata

Lack of a single source of truth for services, business units, and personnel made incident notification and fault‑portrait construction difficult. Bilibili rebuilt the service tree and On‑Call mappings to resolve this.

4.2 Change in Work Mode

Shifting from manual, ad‑hoc coordination to system‑driven workflows required cultural adaptation, UI/UX refinements, and continuous training.

Conclusion & Outlook

Bilibili’s systematic SRE implementation has linked organization, process, and platform, achieving data‑driven stability assessment and faster incident recovery. Future work will explore AI‑assisted fault localization, early‑warning of hidden risks, and automated self‑healing scenarios.

operationsMetricsSREincident responseOncallBusiness Stability
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.