Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation
Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.
Introduction
With Bilibili’s rapid growth in recent years, the scale of its business has expanded dramatically, the iteration speed has accelerated, and system complexity has increased. Daily online incidents have become more frequent and harder to diagnose. To keep the service stable at a high baseline, Bilibili established a dedicated SRE team that works on stability from both theoretical support and capability building, covering emergency response, incident operation, disaster recovery drills, and cultural awareness.
Theoretical Guidance
The article first defines stability theory and explains why theory is essential: it moves practitioners from merely knowing what happens to understanding why it happens, enabling more strategic capability building.
Key concepts introduced include:
2.1 Business Stability Operation
Business
In software, a business is a set of inter‑related services or applications that together achieve a goal.
Stability
Stability, as defined by Wikipedia, means a system produces bounded output for bounded input. In practice, a service (e.g., Bilibili’s “like” feature) is stable if user actions produce the expected result.
Operation
Operation refers to the planned, organized, and controlled management activities that SRE performs to prevent or reduce instability.
2.2 Incident (事态)
Borrowing from ITIL v4, an incident (事态) records any significant change to a resource (hardware, software, configuration). Bilibili aggregates alerts, changes, public complaints, and On‑Call tickets under this umbrella to enable unified analysis and rapid problem identification.
2.3 Fault
Both incidents and disasters are treated as “faults”. Any deviation from expected behavior is a fault, and unresolved incidents that expand in impact become faults.
2.4 Emergency Response
Derived from the GB/T24363‑2009 security‑incident standard, emergency response in stability focuses on three core elements: people, process, and platform.
People – the responders whose mindset and expertise are critical.
Process – standardized procedures for consistent handling.
Platform – tools that support people and processes, measure each stage, and drive continuous improvement.
2.5 Fault Lifecycle
Faults can be divided by stage (pre‑, during, post‑) or by workflow (prevention, occurrence, response,定位, recovery, review). This segmentation guides the design of response processes and platform features.
2.6 Stability Metrics
Fault‑Stage Metrics
The primary indicators are MTBF (Mean Time Between Failures) and MTTR (Mean Time To Recovery). Google further splits MTTR into:
MTTI – Mean Time To Incident detection.
MTTK – Mean Time To Locate the root cause.
MTTF – Mean Time To Fix the fault.
MTTV – Mean Time To Verify the fix.
Operational Metrics
Additional metrics include the ratio of manually reported vs. automatically detected faults, effectiveness of impact assessment, conversion rate from incident to fault, incident hand‑off rate, and improvement‑task completion & recurrence rates.
Operational Practice
3.1 Emergency Principles
The first principle is “stop loss, then locate”. When a fault occurs, immediate mitigation (e.g., rollback, restart, scaling) should precede root‑cause analysis.
Operations triad: Restart, Rollback, Scale‑out.
Service‑governance tools: Circuit‑breaker, Rate‑limiting, Degradation.
3.2 Organization & People
Effective incident handling requires the right people at the right time. Bilibili built an On‑Call system and a Fault‑Command IC (Incident Commander) to map business, organization, and personnel.
On‑Call System
The system provides calendar‑based duty scheduling, API access, and real‑time notifications, solving problems such as “cannot find the owner” and “being disturbed outside duty hours”.
Views
Two hierarchical views are maintained:
Function view – Organization → Function → Coverage (service). Duty tables attach to coverage nodes.
Business view – Organization → Business → Function. Duty tables attach to functions.
Both views share the same underlying data, ensuring consistency.
Fault‑Command IC
A virtual team that leads major incidents, clarifies responsibilities, and ensures precise information flow. After resolution, the IC drives post‑mortem, improvement tracking, and follow‑up.
Awareness
Regular internal sharing and cultural initiatives raise stability awareness among all engineers.
3.3 Efficient Collaboration
Key collaboration features include:
Clear role display on the incident detail page.
One‑click invitation of additional responders.
Automatic creation of emergency collaboration groups with incident briefs.
3.4 Fault Portrait (Root‑Cause Assistance)
Faults are modeled in three layers – hardware, infrastructure, and business – to generate a topological “portrait” that links related incidents, changes, and alerts, accelerating diagnosis.
3.5 Effective Post‑Mortem
Post‑mortems are required within 24 hours for normal faults and 48 hours for major faults. The process includes timeline reconstruction, root‑cause analysis (technical, organizational, procedural), and actionable improvement tasks. Automation links incident data to post‑mortem documents, standardizes formats, and generates reports for management.
Challenges
4.1 Unified Metadata
Lack of a single source of truth for services, business units, and personnel made incident notification and fault‑portrait construction difficult. Bilibili rebuilt the service tree and On‑Call mappings to resolve this.
4.2 Change in Work Mode
Shifting from manual, ad‑hoc coordination to system‑driven workflows required cultural adaptation, UI/UX refinements, and continuous training.
Conclusion & Outlook
Bilibili’s systematic SRE implementation has linked organization, process, and platform, achieving data‑driven stability assessment and faster incident recovery. Future work will explore AI‑assisted fault localization, early‑warning of hidden risks, and automated self‑healing scenarios.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.