Operations 18 min read

How Bilibili Scaled Server Fault Management with Automated Detection and Repair

This article details Bilibili's evolving server fault management architecture, covering fault classification, the shortcomings of manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerts, and end‑to‑end repair automation.

Cognitive Technology Team
Cognitive Technology Team
Cognitive Technology Team
How Bilibili Scaled Server Fault Management with Automated Detection and Repair

1. Background

With Bilibili’s rapid business growth, the number of servers has exploded, bringing challenges in fault management as the scale of machines increases.

Low efficiency of manual handling: traditional manual troubleshooting cannot keep up with the massive server fleet.

Fragmented toolchains: diverse hardware requires different tools, increasing complexity and time cost.

Efficient server fault management is therefore critical for platform stability and user experience.

2. Server Faults

2.1 Fault Classification

Faults are divided into soft faults (software errors, service exceptions) and hard faults (disk, NIC, GPU hardware failures). Additionally, faults can be categorized by repair mode: online (non‑impacting) and offline (requires downtime).

Fault classification diagram
Fault classification diagram

We currently focus on detecting and handling hard faults and file‑system faults.

2.2 Shortcomings of Traditional Fault Management

Delayed fault discovery: reliance on manual checks or user feedback.

Low investigation efficiency: manual root‑cause analysis is time‑consuming.

High communication cost during business migration: manual notifications increase response time.

Insufficient automation in the repair process: lack of systematic audit and traceability.

Thus we decided to fully automate fault detection and repair.

2.3 Goals

Eliminate delayed fault discovery and low investigation efficiency by implementing an automated detection pipeline (information collection → rule matching → alert).

Reduce communication cost and improve process automation through an automated repair workflow.

3. Automated Fault Detection Solution

The overall architecture consists of five core components: an in‑band Agent, a log platform, a detection service, a rule database, and a fault‑management platform.

Agent: lightweight component on each server that collects hardware status (disk, NIC, GPU) and reports to the detection service.

Log platform: receives system logs via rsyslog for analysis.

Detection service: includes in‑band modules (processing Agent data and dmesg logs) and out‑of‑band modules (processing SNMP traps and Redfish API data).

DB: stores detection rules and generated alerts.

Fault‑management platform: visualizes alerts for users.

Automated detection architecture
Automated detection architecture

3.1 Information Collection Methods

We identified two main collection methods: in‑band and out‑of‑band.

In‑band Collection

Relies on the operating system and software tools, providing rich system data and fine‑grained monitoring, but stops working when the server crashes.

Out‑of‑band Collection

Uses independent management hardware (BMC) to gather data even when the server is down, focusing on hardware‑level metrics.

Combining both methods yields comprehensive monitoring in large data centers.

3.1.1 In‑band Collection Details

Disk Module monitors storage devices (NVMe, SSD, HDD) and controllers, collecting health metrics such as remaining life percentage and bad block count. The Agent invokes tools like dmidecode and lspci, structures the data, and reports it to detection and asset services.

Disk collection architecture
Disk collection architecture

GPU Module monitors core and memory utilization, ECC errors, power consumption, etc. The Agent wraps DCGM and DCMI interfaces to support NVIDIA, Huawei, and other GPUs, normalizing health data for the detection service.

GPU collection architecture
GPU collection architecture

3.1.2 Out‑of‑band Collection Details

When a server crashes, out‑of‑band collection gathers critical fault information via Redfish API or SNMP traps.

SNMP Trap vs Redfish

Higher accuracy: OID‑based mapping provides fine‑grained fault classification, reducing false positives.

Higher flexibility: subscription mechanisms allow selective monitoring per vendor/model.

Because OID definitions vary across vendors, we supplement SNMP with Redfish health checks for coverage.

3.2 Fault Rule Management

A unified rule library standardizes fault identification, impact assessment, and handling procedures. Key fields include fault code, description, type (e.g., NIC, NVMe, GPU), severity level (P0‑P2), and rule expression (log keywords, metric thresholds).

Fault rule library example
Fault rule library example

4. Automated Repair Solution

4.1 Business Up/Down Automation

Previously, fault notifications relied on manual WeChat messages, leading to delays. The new automated workflow triggers repair tasks, generates callbacks, and closes the repair loop, improving efficiency.

Repair automation flow
Repair automation flow

The process includes fault detection → task generation → callback confirmation → repair execution, supporting both online (non‑downtime) and offline (requires downtime) repairs.

4.2 Repair Process Automation

Key functions:

Email or API notifications to relevant parties after task creation.

Automatic asset updates after hardware replacement (disk, NIC, GPU, etc.).

Automated server status transitions (e.g., "repair requested", "delivered").

Post‑repair health checks to verify restored operation.

Repair process automation
Repair process automation

5. Summary and Outlook

5.1 Summary

The architecture integrates hardware monitoring, log analysis, and fault management to achieve full‑lifecycle server fault handling. Combined in‑band and out‑of‑band monitoring yields:

Coverage: 99%

Accuracy: 99%

Recall: 95%

Real‑time monitoring, proactive alerts, and periodic inspections ensure most faults are detected and resolved promptly.

5.2 Outlook

Intelligent monitoring: Apply machine learning and AI to detect early fault signs and provide predictive alerts.

More efficient fault localization: New technologies will streamline diagnosis and accelerate resolution, reducing maintenance costs.

Enhanced security and reliability: Future maintenance will emphasize hardware/software security to prevent data leaks and unauthorized access.

Monitoringoperationsserver fault managementin‑band collectionout‑of‑band collection
Cognitive Technology Team
Written by

Cognitive Technology Team

Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.